Month: October 2015

Biostatistics – Types of epidemiological studies

Biostatistics is an innovative field that involves the design, analysis, and interpretation of data as applied to biological areas. Biological laboratory experiments, medical research (including clinical research) and health services. Biostatistics experts arrive at conclusions about disease and health risks by evaluating and applying mathematical and statistical formulas to the factors that impact health.

The types of epidemiological studies are represented in the figure.


For the evaluation of diagnostic tests, we can see in the figure how to calculate the value of sensibility, specificity, the positive predictive value and the negative predictive value. Sen_spec

Other tests can be made like the calculation of positive or negative likelihood ratio. It calculates the ratio between a probability the test is positive/negative that has the disease, and the probability the test is negative/positive that has not the disease.


This are some fundamental ideas and basic concepts for who want to learn and have an introduction in this field of statistics.

Data Science, Big Data and Statistics – can we all live together?

Terry Speed, Walter & Eliza Hall Institute of Medical Research in Melbourne, and emeritus professor in Statistics at University of California at Berkeley.

Data Science, Big Data and Statistics – can we all live together? from Chalmers Internal on Vimeo.

Time Series – Introduction to ARIMA Models

timeseriesThe ARIMA model, also known as the Box-Jenkins model or methodology, is commonly used in analysis and forecasting. The use of ARIMA for forecasting time series is essential with uncertainty as it does not assume knowledge of any underlying model or relationships as in some other methods. ARIMA essentially relies on past values of the series as well as previous error terms for forecasting . However, ARIMA models are relatively more robust and efficient than more complex structural models in relation to short-run forecasting.

ARIMA models are, in theory, the most general class of models for forecasting a time series which can be made to be “stationary” by differencing (if necessary), perhaps in conjunction with nonlinear transformations such as logging or deflating (if necessary). A random variable that is a time series is stationary if its statistical properties are all constant over time.  A stationary series has no trend, its variations around its mean have a constant amplitude, and it wiggles in a consistent fashion, i.e., its short-term random time patterns always look the same in a statistical sense.  The latter condition means that its autocorrelations (correlations with its own prior deviations from the mean) remain constant over time, or equivalently, that its power spectrum remains constant over time.  A random variable of this form can be viewed (as usual) as a combination of signal and noise, and the signal (if one is apparent) could be a pattern of fast or slow mean reversion, or sinusoidal oscillation, or rapid alternation in sign, and it could also have a seasonal component.  An ARIMA model can be viewed as a “filter” that tries to separate the signal from the noise, and the signal is then extrapolated into the future to obtain forecasts.

ARIMA stands for Autoregressive Integrated Moving Average models. Univariate (single vector). Its main application is in the area of short term forecasting requiring at least 40 historical data points. It works best when your data exhibits a stable or consistent pattern over time with a minimum amount of outliers. This methodology is usually superior to exponential smoothing techniques when the data is reasonably long and the correlation between past observations is stable. If the data is short or highly volatile, then some smoothing method may perform better. If you do not have at least 38 data points, you should consider some other method. (B. G. Tabachnick and L. S. Fidell)

Sampling techniques

Information on characteristics of populations is constantly needed by politicians, marketing departments of companies, public officials responsible for planning health and social services, and others. For reasons relating to timeliness and cost, this information is often obtained by use of sample surveys

A sample survey may be defined as a study involving a subset (or sample) of individuals selected from a larger population. Variables or characteristics of interest are observed or measured on each of the sampled individuals. These measurements are then aggregated over all individuals in the sample to obtain summary statistics (e.g., means, proportions, and totals) for the sample. It is from these summary statistics that extrapolations can be made concerning the entire population. The validity and reliability of these extrapolations depend on how well the sample was chosen and on how well the measurements were made.

Sample surveys belong to a larger class of nonexperimental studies generally given the name “observational studies” in the health or social sciences literature. Most sample surveys can be put in the class of observational studies known as “cross-sectional studies.” Other types of observational studies include cohort studies and case-control studies. Cross-sectional studies are “snapshots” of a population at a single point in time, having as objectives either the estimation of the prevalence or the mean level of some characteristics of the population or the measurement of the relationship between two or more variables measured at the same point in time. Cohort and case-control studies are used for analytic rather than for descriptive purposes. For example, they are used in epidemiology to test hypotheses about the association between exposure to suspected risk factors and the incidence of specific diseases. (Levy, P. & Lemeshow, S.)

It is incumbent on the researcher to clearly define the target population. There are no strict rules to follow, and the researcher must rely on logic and judgment. The population is defined in keeping with the objectives of the study.

Usually, the population is too large for the researcher to attempt to survey all of its members. A small, but carefully chosen sample can be used to represent the population. The sample reflects the characteristics of the population from which it is drawn.

Sampling methods are classified as either probability or nonprobability. In probability samples, each member of the population has a known non-zero probability of being selected. Probability methods include simple random, systematic random, stratified and multi-stage cluster sampling. In nonprobability sampling, members are selected from the population in some nonrandom manner. These include convenience, snowball, quota and judgment sampling. The advantage of probability sampling is that sampling error can be calculated. Sampling error is the degree to which a sample might differ from the population. When inferring to the population, results are reported plus or minus the sampling error. In nonprobability sampling, the degree to which the sample differs from the population remains unknown.sampling-techniques-in-research-13-638