Tag: Statistics

Survival Analysis – An introduction

Survival-WordleSurvival time refers to a variable which measures the time from a particular starting time (e.g., time initiated the treatment) to a particular endpoint of interest (time-to-event).

In biomedical applications, this is known as survival analysis, and the times may represent the survival time of a living organism or the time until a diseased is cured.


These methods can also applied to data from different areas like social sciences (time for doing some task), economics (time looking for employment) and engineering (time to a failure of some electronic component).

Areas of application:

– Clinical Trials (e.g., Recovery Time after heart surgery).

– Longitudinal or Cohort Studies (e.g., Time to observing the event of interest).

– Events may include death, injury, onset of illness, recovery from illness (binary variables) or transition above or below the clinical threshold of a meaningful continuous variable (e.g. CD4 counts).

Study Types:

– Clinical Studies

      Time origin = enrollment

      Time axis = time on study

      Right censoring common

– Epidemiological Studies

      Time axis = age

      Right censoring common

      Left truncation common

– Longitudinal or Cohort Studies (e.g., Time to observing the event of interest).

The main goals of survival analysis are estimate time-to-event for a group of individuals, compare time-to-event between two or more groups and assess the relationship of covariates to time-to-event, such as example, does weight, insulin resistance, or cholesterol influence survival time of MI patients?

The distinguishing feature of survival analysis is that it incorporates censoring. Censoring occurs when we have some information about individual survival time, but we don’t know the time exactly.

Truncation on survival data occurs when only those individuals whose event time lies within a certain observation window (YL,YR) are observed, those who enter the study at time t are a random sample of those in the population still at risk at t.

Truncation and Censoring:

Truncation is about entering the study

      Right: Event has occurred (e.g. Cancer registry)

      Left: “staggered entry”

– Censoring is about leaving the study

      Right: Incomplete follow-up (common)

      Left: Observed time > survival time

– Independence is key.

censoring-3Kaplan-Meier method and Cox regression are nonparametric techniques with wide applicability in survival analysis. They are appropriate when time-to-event data are analyzed as outcome measure. They are especially efficient when follow-up times vary, which is common in clinical research. The data needed are time to event or last follow-up, last status (e.g. experienced the event, under follow-up, lost to follow-up, died) and explanatory or confounding variables (e.g. sex, age, type of glaucoma). Subjects who did not experience the event are “censored” at last follow-up. Censoring must be independent of the probability of experiencing the event, and the subject must remain at risk of the event after censoring. Cox regression additionally requires that the hazard be proportional (i.e. hazard ratio is constant over time). Kaplan-Meier analysis produces stepped curves which show the cumulative probability of experiencing the event as a function of time by study group. Groups can be compared using the log-rank test or equivalent. Cox regression provides a numerical hazard ratio (e.g. increased or decreased risk of the study group to experience the event relative to the control group), which is adjusted for the effect of other variables included in the regression model.

Data Science, Big Data and Statistics – can we all live together?

Terry Speed, Walter & Eliza Hall Institute of Medical Research in Melbourne, and emeritus professor in Statistics at University of California at Berkeley.

Data Science, Big Data and Statistics – can we all live together? from Chalmers Internal on Vimeo.

Time Series – Introduction to ARIMA Models

timeseriesThe ARIMA model, also known as the Box-Jenkins model or methodology, is commonly used in analysis and forecasting. The use of ARIMA for forecasting time series is essential with uncertainty as it does not assume knowledge of any underlying model or relationships as in some other methods. ARIMA essentially relies on past values of the series as well as previous error terms for forecasting . However, ARIMA models are relatively more robust and efficient than more complex structural models in relation to short-run forecasting.

ARIMA models are, in theory, the most general class of models for forecasting a time series which can be made to be “stationary” by differencing (if necessary), perhaps in conjunction with nonlinear transformations such as logging or deflating (if necessary). A random variable that is a time series is stationary if its statistical properties are all constant over time.  A stationary series has no trend, its variations around its mean have a constant amplitude, and it wiggles in a consistent fashion, i.e., its short-term random time patterns always look the same in a statistical sense.  The latter condition means that its autocorrelations (correlations with its own prior deviations from the mean) remain constant over time, or equivalently, that its power spectrum remains constant over time.  A random variable of this form can be viewed (as usual) as a combination of signal and noise, and the signal (if one is apparent) could be a pattern of fast or slow mean reversion, or sinusoidal oscillation, or rapid alternation in sign, and it could also have a seasonal component.  An ARIMA model can be viewed as a “filter” that tries to separate the signal from the noise, and the signal is then extrapolated into the future to obtain forecasts.

ARIMA stands for Autoregressive Integrated Moving Average models. Univariate (single vector). Its main application is in the area of short term forecasting requiring at least 40 historical data points. It works best when your data exhibits a stable or consistent pattern over time with a minimum amount of outliers. This methodology is usually superior to exponential smoothing techniques when the data is reasonably long and the correlation between past observations is stable. If the data is short or highly volatile, then some smoothing method may perform better. If you do not have at least 38 data points, you should consider some other method. (B. G. Tabachnick and L. S. Fidell)

Choosing the Correct Statistical Test

Usually your data could be analyzed in multiple ways, each of which could yield legitimate answers. The table below covers a number of common analyses and helps you choose among them based on the number of dependent variables (sometimes referred to as outcome variables), the nature of your independent variables (sometimes referred to as predictors). You also want to consider the nature of your dependent variable, namely whether it is an interval variable, ordinal or categorical variable, and whether it is (approximately) normally distributed.

 Choosing the Correct Statistical Test