An overview of longitudinal data analysis

Longitudinal studies play an essencial role in epidemiology, clinical research, and therapeutic evaluation. They are used to characterize normal growth and aging, to assess the effect of risk factors on human health, and to evaluate the effectiveness of treatments.

A longitudinal study is an investigation where participant outcomes and possibly treatments or exposures are collected at multiple follow-up times. Each participant generally produces multiple or “repeated” measurements. This repeated measures data are correlated within subjects and need special statistical techniques for valid analysis and inference. Another important outcome that is measured in a longitudinal study is the time until a key clinical event such as disease recurrence or death. Analysis of event time endpoints is more the focus of survival analysis.

Some advantages of longitudinal studies are, recording incident events, prospective investigation of exposure, measurement of individual change in outcomes, separation of time effects (Cohort, Period, Age) and control for cohort effects.

The exploratory analysis of longitudinal data try to discover patterns of systematic variation across groups of patients, as well as aspects of random variation that distinguish individual patients. Summary statistics such as means and standard deviations can reveal whether different groups are changing in a similar or different way.

In variation among individuals, the “distance” between measurements on different subjects is usually expected to be greater than the distance betweeen repeated measurements taken on the same subject. Graphical methods can be used to explore the magnitude of variability in outcomes over time.

Guide 4-7

With correlated outcomes it is important to understand the strength of correlation and the pattern of correlations across time. Characterizing correlation helps to understand components of variation and for identifying a variance or correlation model for regression methods such as mixed-effects models or generalized estimating equations (GEE) and other models. These models will be discussed in another post.

Statistical inference with longitudinal data requires that a univariate summary be created for each subject and the methods for correlated data are used, the most common summaries are the average response and the time slope. A second approach is a “pre-post” analysis which analyzes a single follow-up response in conjunction with a baseline measurement

One of the major issues associated with the analysis of longitudinal data is missing data, that arise when subjects drop out of the study. It is assumed that once a participant drops out they provide no further outcome information. Missing data can lead to biased estimates of means and/or regression parameters when the probability of missingness is associated with outcomes. For longitudinal data a missing data classication is based on whether observed or unobserved outcomes are predictive of missing data (Laird 1988): Missing Completely at Random (MCAR), Missing at Random (MAR) and Non-Ignorable (NI), when data are MCAR standard statistical summaries based on the observed data remain valid. However, if data are MAR or NI then summaries based on the available cases may be biased. External information can help determine whether missingness mechanisms may be classied as MCAR, MAR, or NI. There are several statistical approaches that attempt to alleviate bias due to missing data.

Texts by Diggle et al. [2002], Verbeke and Molenberghs [2000], Brown and Prescott [1999], and Crowder and Hand [1990] are a good introduction for the analysis of longitudinal data.


Survival Analysis – An introduction

Survival-WordleSurvival time refers to a variable which measures the time from a particular starting time (e.g., time initiated the treatment) to a particular endpoint of interest (time-to-event).

In biomedical applications, this is known as survival analysis, and the times may represent the survival time of a living organism or the time until a diseased is cured.


These methods can also applied to data from different areas like social sciences (time for doing some task), economics (time looking for employment) and engineering (time to a failure of some electronic component).

Areas of application:

– Clinical Trials (e.g., Recovery Time after heart surgery).

– Longitudinal or Cohort Studies (e.g., Time to observing the event of interest).

– Events may include death, injury, onset of illness, recovery from illness (binary variables) or transition above or below the clinical threshold of a meaningful continuous variable (e.g. CD4 counts).

Study Types:

– Clinical Studies

      Time origin = enrollment

      Time axis = time on study

      Right censoring common

– Epidemiological Studies

      Time axis = age

      Right censoring common

      Left truncation common

– Longitudinal or Cohort Studies (e.g., Time to observing the event of interest).

The main goals of survival analysis are estimate time-to-event for a group of individuals, compare time-to-event between two or more groups and assess the relationship of covariates to time-to-event, such as example, does weight, insulin resistance, or cholesterol influence survival time of MI patients?

The distinguishing feature of survival analysis is that it incorporates censoring. Censoring occurs when we have some information about individual survival time, but we don’t know the time exactly.

Truncation on survival data occurs when only those individuals whose event time lies within a certain observation window (YL,YR) are observed, those who enter the study at time t are a random sample of those in the population still at risk at t.

Truncation and Censoring:

Truncation is about entering the study

      Right: Event has occurred (e.g. Cancer registry)

      Left: “staggered entry”

– Censoring is about leaving the study

      Right: Incomplete follow-up (common)

      Left: Observed time > survival time

– Independence is key.

censoring-3Kaplan-Meier method and Cox regression are nonparametric techniques with wide applicability in survival analysis. They are appropriate when time-to-event data are analyzed as outcome measure. They are especially efficient when follow-up times vary, which is common in clinical research. The data needed are time to event or last follow-up, last status (e.g. experienced the event, under follow-up, lost to follow-up, died) and explanatory or confounding variables (e.g. sex, age, type of glaucoma). Subjects who did not experience the event are “censored” at last follow-up. Censoring must be independent of the probability of experiencing the event, and the subject must remain at risk of the event after censoring. Cox regression additionally requires that the hazard be proportional (i.e. hazard ratio is constant over time). Kaplan-Meier analysis produces stepped curves which show the cumulative probability of experiencing the event as a function of time by study group. Groups can be compared using the log-rank test or equivalent. Cox regression provides a numerical hazard ratio (e.g. increased or decreased risk of the study group to experience the event relative to the control group), which is adjusted for the effect of other variables included in the regression model.

Biostatistics – Types of epidemiological studies

Biostatistics is an innovative field that involves the design, analysis, and interpretation of data as applied to biological areas. Biological laboratory experiments, medical research (including clinical research) and health services. Biostatistics experts arrive at conclusions about disease and health risks by evaluating and applying mathematical and statistical formulas to the factors that impact health.

The types of epidemiological studies are represented in the figure.


For the evaluation of diagnostic tests, we can see in the figure how to calculate the value of sensibility, specificity, the positive predictive value and the negative predictive value. Sen_spec

Other tests can be made like the calculation of positive or negative likelihood ratio. It calculates the ratio between a probability the test is positive/negative that has the disease, and the probability the test is negative/positive that has not the disease.


This are some fundamental ideas and basic concepts for who want to learn and have an introduction in this field of statistics.

Time Series – Introduction to ARIMA Models

timeseriesThe ARIMA model, also known as the Box-Jenkins model or methodology, is commonly used in analysis and forecasting. The use of ARIMA for forecasting time series is essential with uncertainty as it does not assume knowledge of any underlying model or relationships as in some other methods. ARIMA essentially relies on past values of the series as well as previous error terms for forecasting . However, ARIMA models are relatively more robust and efficient than more complex structural models in relation to short-run forecasting.

ARIMA models are, in theory, the most general class of models for forecasting a time series which can be made to be “stationary” by differencing (if necessary), perhaps in conjunction with nonlinear transformations such as logging or deflating (if necessary). A random variable that is a time series is stationary if its statistical properties are all constant over time.  A stationary series has no trend, its variations around its mean have a constant amplitude, and it wiggles in a consistent fashion, i.e., its short-term random time patterns always look the same in a statistical sense.  The latter condition means that its autocorrelations (correlations with its own prior deviations from the mean) remain constant over time, or equivalently, that its power spectrum remains constant over time.  A random variable of this form can be viewed (as usual) as a combination of signal and noise, and the signal (if one is apparent) could be a pattern of fast or slow mean reversion, or sinusoidal oscillation, or rapid alternation in sign, and it could also have a seasonal component.  An ARIMA model can be viewed as a “filter” that tries to separate the signal from the noise, and the signal is then extrapolated into the future to obtain forecasts.

ARIMA stands for Autoregressive Integrated Moving Average models. Univariate (single vector). Its main application is in the area of short term forecasting requiring at least 40 historical data points. It works best when your data exhibits a stable or consistent pattern over time with a minimum amount of outliers. This methodology is usually superior to exponential smoothing techniques when the data is reasonably long and the correlation between past observations is stable. If the data is short or highly volatile, then some smoothing method may perform better. If you do not have at least 38 data points, you should consider some other method. (B. G. Tabachnick and L. S. Fidell)

Sampling techniques

Information on characteristics of populations is constantly needed by politicians, marketing departments of companies, public officials responsible for planning health and social services, and others. For reasons relating to timeliness and cost, this information is often obtained by use of sample surveys

A sample survey may be defined as a study involving a subset (or sample) of individuals selected from a larger population. Variables or characteristics of interest are observed or measured on each of the sampled individuals. These measurements are then aggregated over all individuals in the sample to obtain summary statistics (e.g., means, proportions, and totals) for the sample. It is from these summary statistics that extrapolations can be made concerning the entire population. The validity and reliability of these extrapolations depend on how well the sample was chosen and on how well the measurements were made.

Sample surveys belong to a larger class of nonexperimental studies generally given the name “observational studies” in the health or social sciences literature. Most sample surveys can be put in the class of observational studies known as “cross-sectional studies.” Other types of observational studies include cohort studies and case-control studies. Cross-sectional studies are “snapshots” of a population at a single point in time, having as objectives either the estimation of the prevalence or the mean level of some characteristics of the population or the measurement of the relationship between two or more variables measured at the same point in time. Cohort and case-control studies are used for analytic rather than for descriptive purposes. For example, they are used in epidemiology to test hypotheses about the association between exposure to suspected risk factors and the incidence of specific diseases. (Levy, P. & Lemeshow, S.)

It is incumbent on the researcher to clearly define the target population. There are no strict rules to follow, and the researcher must rely on logic and judgment. The population is defined in keeping with the objectives of the study.

Usually, the population is too large for the researcher to attempt to survey all of its members. A small, but carefully chosen sample can be used to represent the population. The sample reflects the characteristics of the population from which it is drawn.

Sampling methods are classified as either probability or nonprobability. In probability samples, each member of the population has a known non-zero probability of being selected. Probability methods include simple random, systematic random, stratified and multi-stage cluster sampling. In nonprobability sampling, members are selected from the population in some nonrandom manner. These include convenience, snowball, quota and judgment sampling. The advantage of probability sampling is that sampling error can be calculated. Sampling error is the degree to which a sample might differ from the population. When inferring to the population, results are reported plus or minus the sampling error. In nonprobability sampling, the degree to which the sample differs from the population remains unknown.sampling-techniques-in-research-13-638

Choosing the Correct Statistical Test

Usually your data could be analyzed in multiple ways, each of which could yield legitimate answers. The table below covers a number of common analyses and helps you choose among them based on the number of dependent variables (sometimes referred to as outcome variables), the nature of your independent variables (sometimes referred to as predictors). You also want to consider the nature of your dependent variable, namely whether it is an interval variable, ordinal or categorical variable, and whether it is (approximately) normally distributed.

 Choosing the Correct Statistical Test