Month: January 2016

# An overview of longitudinal data analysis

Longitudinal studies play an essencial role in epidemiology, clinical research, and therapeutic evaluation. They are used to characterize normal growth and aging, to assess the effect of risk factors on human health, and to evaluate the effectiveness of treatments.

A longitudinal study is an investigation where participant outcomes and possibly treatments or exposures are collected at multiple follow-up times. Each participant generally produces multiple or “repeated” measurements. This repeated measures data are correlated within subjects and need special statistical techniques for valid analysis and inference. Another important outcome that is measured in a longitudinal study is the time until a key clinical event such as disease recurrence or death. Analysis of event time endpoints is more the focus of survival analysis.

Some advantages of longitudinal studies are, recording incident events, prospective investigation of exposure, measurement of individual change in outcomes, separation of time effects (Cohort, Period, Age) and control for cohort effects.

The exploratory analysis of longitudinal data try to discover patterns of systematic variation across groups of patients, as well as aspects of random variation that distinguish individual patients. Summary statistics such as means and standard deviations can reveal whether different groups are changing in a similar or different way.

In variation among individuals, the “distance” between measurements on different subjects is usually expected to be greater than the distance betweeen repeated measurements taken on the same subject. Graphical methods can be used to explore the magnitude of variability in outcomes over time.

With correlated outcomes it is important to understand the strength of correlation and the pattern of correlations across time. Characterizing correlation helps to understand components of variation and for identifying a variance or correlation model for regression methods such as mixed-effects models or generalized estimating equations (GEE) and other models. These models will be discussed in another post.

Statistical inference with longitudinal data requires that a univariate summary be created for each subject and the methods for correlated data are used, the most common summaries are the average response and the time slope. A second approach is a “pre-post” analysis which analyzes a single follow-up response in conjunction with a baseline measurement

One of the major issues associated with the analysis of longitudinal data is missing data, that arise when subjects drop out of the study. It is assumed that once a participant drops out they provide no further outcome information. Missing data can lead to biased estimates of means and/or regression parameters when the probability of missingness is associated with outcomes. For longitudinal data a missing data classication is based on whether observed or unobserved outcomes are predictive of missing data (Laird 1988): Missing Completely at Random (MCAR), Missing at Random (MAR) and Non-Ignorable (NI), when data are MCAR standard statistical summaries based on the observed data remain valid. However, if data are MAR or NI then summaries based on the available cases may be biased. External information can help determine whether missingness mechanisms may be classied as MCAR, MAR, or NI. There are several statistical approaches that attempt to alleviate bias due to missing data.

Texts by Diggle et al. [2002], Verbeke and Molenberghs [2000], Brown and Prescott [1999], and Crowder and Hand [1990] are a good introduction for the analysis of longitudinal data.