**Statistical Approaches for Analyzing a Continuous Outcome in Experimental Studies**

**Métodos Estadísticos para Análizar un Resultado Continuo en Estudios Experimentales**

**Antonio Sanhueza*,**; Tamara Otzen**,***; Carlos Manterola****,***** & Nelson Araneda********

* Department of Mathematics and Statistics, Universidad de La Frontera, Temuco, Chile.

** Department of Psychology, Universidad Autónoma de Chile, Temuco, Chile.

]]> *** Ph.D. Program in Medical Sciences, Universidad de La Frontera, Temuco, Chile.**** Department of Surgery and Traumatology, Universidad de La Frontera, Temuco, Chile.

***** Centro de Investigación en Ciencias Biomédicas, Universidad Autónoma de Chile, Temuco, Chile.

****** Department of Education, Universidad de La Frontera, Temuco, Chile.

**SUMMARY**: In experimental studies the outcome variable is measured at initial time, usually called "baseline", and then in several times called "follow-up" measurement(s). The study question of interest in an experimental study is whether there is a significant difference effect between treatment and comparison group, after intervention. In addition, one wants to estimate the difference effect between groups. This paper studies some of the strategies, including a simulation process, that one can be used for analyzing data coming from an experimental study as above, and considers using or not using the baseline measurements. Three parametric and two non-parametric strategies are evaluated considering only one follow-up measurement. The baseline measurement is incorporated in context in these strategies.

**KEY WORDS: Experimental Studies; Follow-Up Studies; Biostatistics"Statistics; Nonparametric.**

**RESUMEN**: En estudios experimentales, la variable resultado se mide en el momento inicial y luego en diversas ocasiones. De este modo, se habla de mediciones de "línea de base" y seguimiento respectivamente. Lo interesante de esta materia es poder determinar si una vez aplicada una intervención, existen diferencias significativas entre el grupo al que se asignó un tratamiento de prueba y el grupo de comparación. En este manuscrito se exponen algunas de las estrategias utilizadas para tal propósito; las que incluyen un proceso de simulación mediante datos obtenidos a partir de un estudio experimental. Tres estrategias paramétricas y dos no paramétricas se evalúan teniendo en cuenta sólo una medida de seguimiento. La medida de referencia se incorpora en el contexto de estas estrategias.

**PALABRAS CLAVE: Estudio experimental; Estudios de seguimiento; Bioestadística; Estadísticas no-paramétricas.**

]]>

**INTRODUCTION**

In experimental studies (ES), the term "baseline" is used for the measurements of a participant before the start of intervention. These measurements are the basis for characterizing and describing the population in the study. In addition, the investigators compare the distributions of baseline characteristics in the treatment group with the comparison group. In ES, if randomization worked, one expects that there will be no meaningful differences in these characteristics between the groups. However, if there are big differences, randomization can be called into question (Friedman et al., 1985; Piantadosi et al., 1997).

This paper only considers those ES which randomly assign patients to one of two groups, treatment group or comparison group, and where the outcome of interest is a continuous variable. Under this design, the outcome variable is measured at two times; one time before the intervention (baseline data) and the other time after the intervention (follow-up data).

In these ES the study question of interest is whether there is a significant difference effect between treatment and comparison groups, after intervention. Also, one wants to estimate the difference effect between groups.

In both type of study design, there are different ways to analyze the question of interest. For instance, one can use Nonparametric Analysis or Parametric Analysis. In addition, one of the considerations to deal with is whether to use or not use the baseline data in the analysis.

This manuscript refers to some of the strategies that one can use for analyzing data coming from an ES, and considers using or not using the baseline measurements. Section 2 and 3 show the parametric and nonparametric methods used in the analysis of the data from an ES. In Section 4, we describe some guideline to use these procedures.

**MATERIAL AND METHOD**

**Parametric methods**. In an experimental study, patients are randomly assigned to the comparison group or treatment group, before the intervention is applied. It is assumed that there are continuous measures on each participant at two times, before intervention and after intervention, for an outcome variable of interest. Suppose that measurements of the outcome variable on patients at the baseline (before intervention) for control group and treatment group are Y_{ctrl}, _{b, i,} i = 1, ..., n_{ctrl} and Y_{treat, b, i,} i = 1, ..., n_{treat,} respectively. Similarly, let the corresponding measurements at the follow-up (after intervention) for control group and treatment group be Yctrl, f, i, i = 1, ..., nctrl and Ytreat, f, i, i = 1, ..., ntreat, respectively. Suppose that the variance-covariance matrix in the two groups is identical and equal to:

where = Var(Y_{ctrl, b, i}) = Var(Y_{treat, b, i}) and p=Corr(Y_{ctrl, b, i} ; Y_{treat, b, i}). This assumption means that the variance of the continuous outcome at baseline and at follow-up is the same, , and the correlation between both outcomes is p.

**Simple model**. This strategy uses measures on each participant at follow-up only (no baseline data). It can be represented by where Y_{f} is the outcome at follow-up, a0 is the intercept term, a1 is the effect of difference between treatment group and control group, T is an indicator variable of group, and e is the error term that follows a normal distribution with mean 0 and variance . In this model, the unbiased estimate of is given by:

where are the average values of the outcome at follow-up in the treatment group and control group, respectively. The meaning of this estimator is just the difference of the two means without controlling for the baseline measurements. It is assumed that the randomization produces balance of the outcome between the groups. The variance of the estimate of is:

and if the sample sizes of the treatment and control groups are equal (n_{treat} = n_{ctrl} = n) we obtained the known expression:

Most standard analysis of ES state that randomization takes care of baseline differences and thus this Simple model approach is appropriate.

**Difference Score model**. This method employs differences of the measures of follow-up and baseline on each participant. The following model is used: Y_{f} Y_{b} = b_{0} + b_{1}T + e, where (Y_{f} Y_{b}) is the difference between the outcome variable at follow-up and baseline, B_{0} is the intercept term, b1 is the effect of difference of treatment and control group, and T and e are as before. The estimated difference of treatment effect is given by:

_{treat}=Y

_{treat,f}- Y

_{treat}, b, Z

_{ctrl}= Y

_{ctrl,f}- Y

_{ctrl, b}and Y

_{trl,f}c and Y

_{reat,f}t are the sample means of the outcome variables at follow-up for control group and treatment group, and Y

^{ctrl,b}and Y

^{treat, b}are the sample means at baseline for control group and treatment group, respectively. This estimator considers the potential imbalance in the outcomes due to the randomization process. Note that, B

_{1}compare the differences between the means Z

_{treat}and Z

_{ctrl}adjusted by the averages at baseline.

The variance of the estimate of is given by:

and if the sample size are equals, then:

where p is the within-subject correlation between baseline and follow-up measurements. In general, this correlation is most likely to be positive, reducing the variance of B_{1}.

This approach is often more efficient than using only a single follow-up measurement because the standard error of the effect of difference of treatment and control is usually reduced as the result of using two measurements from each participant.

**Analysis of Covariance (ANCOVA) model**.This method employs the baseline measurement as a covariate. One fits the following regression model: where _{0} is the intercept term, _{1} is the effect of the difference between the treatments, _{2} is the effect of the baseline measurement, and T and e are as before. This model depends on several assumptions, including normality of error terms, equality of error variances for different treatments, equality of slopes for the different treatment regression lines, and linearity of regression. The unbiased estimate of _{1} is given by:

where _{1,ctrl} and _{1,treat} are the estimate slopes on separate line fits for control group and treatment group, S^{2}_{Yb,ctrl} and S^{2}_{Yb,teat} are the sample variances of the outcome variable at baseline for control group and treatment group, Y_{ctrl,f} and Y_{treat,f} are the sample means of the outcome variables at follow-up for control group and treatment group, and Y_{ctrl,b} and Y_{treat,b} are the sample means at baseline for control group and treatment group, respectively. The estimator _{1} is a generalization of the previous estimators shown in 1.1 and 1.2 (Kleinbaum et al., 1998).

Comparison group (T=0):

Treatment group (T=1):

An alternative way to calculate the unbiased estimate of _{1} is :

where:

So, the variance of the estimate of _{1} is given by:

and if n_{treat} = n_{treat} = n then

Thus, the covariance analysis reduces the variance of the treatment effect estimate and thereby is a more powerful statistical test (provides narrower confidence intervals) for examining the difference between groups (Koch et al., 1982).

**Non-parametric methods**. Based on randomization in the study design, the analysis can be nonparametric. For instance, it is possible to use the Wilcoxon-Mann-Whitney test to analyze the strategies of the Simple model and Difference Score model above, and the Rank Analysis of Covariance to analyze the ANCOVA model (Wilcoxon et al., 1945).

The Wilcoxon-Mann-Whitney test can test the null hypothesis that the distribution of an ordinal scale response variable is the same in two independent groups. This statistical test is sensitive to the alternative hypothesis that there is a location difference between the two groups. Also, this statistical test can be used when the t-test is appropriate (Wilcoxon et al.).

The Wilcoxon-Mann-Whitney test converges to the Mantel-Haenszel mean score statistic for the special case of one stratum when rank scores are used, if the sample size is large. Thus, another way to analyze the data under the Simple model and Difference Score model is using the Mantel-Haenszel mean score statistic (Wilcoxon et al.).

As mentioned before, the ANCOVA model depends on several assumptions which one must prove before fitting the model. In situations in which these assumptions are not satisfied, it can be used the Rank Analysis of Covariance (Quade et al., 1982). This technique can be combined with the extended Mantel-Haenszel statistics to establish nonparametric comparisons between treatment groups, after adjusting for the effect of thecovariate (Koch et al., 1982, 1990).

The advantages of nonparametric methods include higher statistical power under certain conditions, exact p-values for the test when sample size is small, and no assumptions of any kind of distribution. However, the disadvantage of this method is the lack of estimates of the magnitude of treatment effects.

**Relative Efficient**. The relative efficiency (Reffic) of the three parametric methods above can be calculated. The relative efficiency of the Difference Score model relative to the Simple model, the relative efficiency of ANCOVA relative to the Simple model, and the relative efficiency of the Difference Score relative to ANCOVA, is going to be calculated. The Reffic is defined in terms of the ratio of the variance of the effect based on each method.

_{bi}indicate the baseline measurement and Y

_{fi}the follow-up measurement, then the difference score is:d

_{i}= Y

_{f i} Y

_{bi}. Assume normality for Ybi and Y

_{f i}, with Var(Y

_{bi}) = Var(Y

_{fi}) = and correlation between Y

_{bi}and Y

_{fi}given by p. The variance of d is given by Var(d

_{i}) = 2(1 p), with the restriction of equal variance for Y

_{bi}and Y

_{fi}. So, the efficiency of the Difference score relative to the Simple model is equal to:

A correlation between Y_{bi} and Y_{fi} greater than 0.5 would make the Difference Score model more efficient than the Simple model. Also, a correlation less than 0.5 insures that the Simple model is more efficient than the Difference Score model.

Under the ANCOVA model, the variance of Y_{fi} given Y_{bi} is:

Then the efficiency of ANCOVA relative to the Simple model is given by:

Reffic(ANCOVA, Simple model) = Var (ANCOVA) / Var(Simple model) = (1-p2)

This value is always less than or equal to 1. Therefore, ANCOVA is never less efficient than the Simple model. The efficiency of the Difference Score relative to ANCOVA is equal to:

Reffic(Difference Score, ANCOVA) = Var (Difference Score) / Var (ANCOVA)

]]>which is always greater than or equal to 1. So, ANCOVA never has less efficiency than the Difference score model.

**Applications**

Example: A Experimental Study to Compare Two Treatments of Cholesterol.

This example is based on data from an experimental community-based trial to compare the efficacy of a school-based treatment with a placebo group for reducing cholesterol levels in children (Harrell et al., 1996). By randomization, 617 children were assigned to the control group and 546 children were assigned to the treatment group. The primary outcome variable was level of cholesterol for the 1163 children, measured at two times, before intervention and after intervention.

The analysis plan for this experimental study identified 5 covariables at baseline as relevant candidates for adjustment, which were Height (cm), Weight (kg), VO2 Max (aerobic capacity, ml/kg/min), Skinfold Sum (mm), and Systolic BP (mmHg).

Table I describes the characteristics of the children at baseline. From this table, one can see the imbalance in the distribution of baseline Cholesterol values for the two groups. The average cholesterol level was 164.9 mg/dl and 168.2 mg/dl in the control group and the treatment group, respectively. The distributions of the covariables Height, Weight, VO2 Max, Skinfold Sum, and Systolic BP do not vary much between the treatment group and control group at baseline (Table I). The statistical analysis for this ES considers the nonparametric and parametric methods mentioned in section 1.

Table I. Baseline Characteristics of the 1163 Children (mean±SD).

**3.1. Parametric Analysis**

Table II shows the results of the analysis for the three models mentioned. For the Simple Model, the effect of the difference between groups is not significant (p-value=0.1168). The estimated effect difference in cholesterol at follow-up of the two groups is 2.73. This analysis is equivalent to using a t-test statistic for two independent samples.

Table II. Parametric Analysis Using Linear Models.

The second model, Difference Score, shows that the difference between the groups is significant (p-value < 0.0001). The estimate of the difference in cholesterol of the groups is 6.05. The adjusted linear model for the difference score is given by:

The ANCOVA model shows that Cholesterol at follow-up depends strongly on the Cholesterol at baseline (p-value < 0.0001) and the difference between groups, after accounting for the baseline Cholesterol level, is significant (p-value < 0.0001). The two adjusted regression lines are:

Comparison group:

and Treatment group:

]]>, respectively and the estimate of difference between treatments after accounting for the baseline Cholesterol level is 5.30.

From the Difference Score model and the ANCOVA model, one can note that there is no difference in the conclusions for this analysis, because the p-values associated with the difference in the groups are almost the same. However, the value of the statistic associated with the difference in groups for the Difference Score model is greater than the value associated with the ANCOVA model, but the standard error of the estimate of difference between the groups for the Difference Score model is less than the standard error of the estimate of the difference for the ANCOVA model (Table II).

The analysis also considered adjusting simultaneously for the covariables Height, Weight, VO2 Max, Skinfold Sum, and Systolic BP at baseline. Table IV and Table V show the results of this analysis. All of the models are ANCOVA models; however, keep the same names as given above (Simple model, Difference Score model, ANCOVA model) in order to facilitate comparisons among the models using additional covariables and those not using the covariables. The Simple model shows no difference between the groups (p-value = 0.1031). The difference score model shows a significant difference between the groups (p-value = 0.0001). The ANCOVA model shows that the effect of the difference between groups is significant (p-value=0.0001).

The conclusions related to the difference between groups obtained from those models with the 5 additional covariables (Height, Weight, VO2 Max, Skinfold Sum, and Systolic BP) at baseline are the same as those without the additional covariables (see Tables IV and V). Also, the estimates of the difference effects between groups are almost the same for those models with the 5 additional covariables and those without the additional covariables. In other words, it is not necessary to adjust for the 5 covariables in the three models used, because the distribution of these 5 covariables at baseline is balanced between the two groups.

Table III. Wilcoxon-Mann-Whitney Test Using the Mantel-Haenszel Score Chi-Square Statistic.

Table IV. Rank Analysis of Covariance combined with Mantel-Haenszel Statistic.

]]> Table V. Analysis Parametric Using Additional Covariables at Baseline.**3.2 Non-parametric Analysis**. The Wilcoxon-Mann-Whitney test was used to analyze the Simple model (no baseline data) and the Difference Score model. This test converges to the Mantel-Haenszel mean score statistic for the special case of one stratum when rank scores are used. In this example, the sample size is large enough, so that Mantel-Haenszel is appropriate for providing confirmatory inferences for treatment group being better than control group. PROC FREQ in SAS was employed to calculate the Mantel-Haenszel statistic.

For the Simple model, Table III shows the Mantel-Haenszel mean score statistic, which indicates that there is no a significant difference between treatment group and control group (chi-square = 2.401, 1 df, p-value = 0.121). Also, this table shows that for the Difference Score model, the Mantel-Haenszel mean score statistic is equal to 26.359 with 1 df, corresponding to a p-value of 0.001. Therefore, the cholesterol level change differs between the treatment group and the control group.

The ANCOVA analysis was carried out using Rank Analysis of Covariance combined with the extended Mantel-Haenszel statistics. This methodology can be implemented in the SAS System, by using PROC RANK, PROC REG, and PROC FREQ.

The ANCOVA analysis results with cholesterol at baseline as the only covariable is given in Table IV. The Mantel-Haenszel statistic is equal to 15.79, with 1 df, and a p-value of 0.001; which indicates a clearly significant difference between treatment group and control group after accounting for the cholesterol level at baseline.

In the ANCOVA analysis with cholesterol, Height, Weight, VO2 Max, Skinfold Sum, and Systolic BP at baseline as the covariables; the Mantel-Haenszel statistic was equal to 18.49, 1 df, with corresponding p-value of 0.001. Thus, there is a significant difference between treatment group and control group after adjusting for the additional baseline covariables. The Mantel-Haenszel statistic from this model is just slightly larger than for the ANCOVA model with cholesterol as the only covariable. One of the reasons for this is the balance in the baseline distribution of these additional covariables for the two groups (see Table I).

The results from nonparametric methods agree with the results obtained using parametric methods.

**Simulations**

**Simulation Process**. This section considers the estimation of the difference effect between groups for the three models mentioned in section 1 using a simulation process. The data utilized for fitting the three models (Simple model, Difference Score model, and ANCOVA model) is generated in the following manner:

Where Z_{1i} and Z_{3i} are independent N (0, 1).

The random variables Z_{1i} and Z_{3i} were created using the function RANNOR in SAS.

B) By using randomization, half of the n data generated at baseline is assigned to the control group and the other half is assigned to the treatment group. PROC PLAN in SAS was used to assign at random.

C) The outcome variable at follow-up, Y_{fi}, is generated in the following manner:

i) If the participant belong to the control group, then

Where Z_{2i} and Z_{3i} are independent N (0, 1).

The random variable Z_{2i} was created using the function RANNOR in SAS, and Z_{3i} has values as before.

where Z_{2i} and Z_{3i} are as before; and B represents the difference effect between the two groups.

Note that the outcome variable at baseline and follow-up as generated above have the same variance , and the correlation between these two variables is equal to P.

The estimation of the difference effect between groups can be affected by different conditions, such as:

a) The standard deviation of the outcome variables at baseline and follow-up, which is assumed to be the same. Two different values will be used: s = 10 and 30

The correlation, p, between the outcome variables at baseline and follow-up. The simulation will use the following values: p = 0.4 and 0.7

The difference effect between groups, B. Three different values will be used: B = 2, 5, and 10

The sample size, n. Two different sample sizes will be used: n = 100 and 1000

Thus, combining all the possible values for *o, p, B* and n, there are 24 different scenarios that one can analyze for comparing the three models of interest. For each of these 24 scenarios, three hundred simulations will be performed.

**RESULTS**

This section presents the results of the simulation, which are tabulated for the 24 scenarios mentioned before. Appendix contains these tables that present the estimation of the difference effect between groups, the statistical tests, and the p-values for the three models fitted.

In general, the results for the three models show that the estimates of the difference effect between groups are unbiased. Also, when the correlation between the outcome variable at baseline and follow-up is small (p = 0.4), the estimated variance of the estimated difference effect between groups under the Simple model is less than that under the Difference Score model. However, when this correlation is high (p = 0.7) the estimated variance of the estimated difference effect between groups under the Simple model is greater than that under the Difference Score model. The results also show that the estimated variance of the estimated difference of group effects under ANCOVA model is the smallest.

For a small value of the correlation between outcome variable at baseline and follow-up, p = 0.4, on average the p-value associated with the difference effect between groups under the Simple model is less than that under the Difference Score model. However, on average the p-value associated with the difference of group effects under ANCOVA model is the lowest, but this value does not vary much with respect to the p-value under the Simple model. Thus, when p is small one can fit the Simple model or ANCOVA model and get the same conclusion with respect to the difference in groups. Also, when the value of p is high, on average the p-values associated with the difference effect between groups under Difference Score model are similar to those under ANCOVA model, so one can fit any of these two models. Therefore, if r is small one can say that the p-value obtained under the Difference Score model is the most conservative. But, if p is large, the p-value under the Simple model is the most conservative.

One can see that the p-values (for the three models) associated with the difference of groups effect when the Standard Deviation is small, SD=10, are less than when the Standard Deviation is high, SD=30.

For a small sample size, n = 100, and a small value of *B, B* = 2, there are no significant p-values (< 0.05), i.e. no significant difference effect between groups. Now, for a small sample size and a medium value of *B, B* = 5, there are significant p-values under the Difference Score model and ANCOVA model, only when the Standard Deviation is small and the correlation between the outcome variable at baseline and follow-up is large. When the sample size is small and b is large, there are significant p-values for the three models only when the Standard Deviation is small.

For a large sample size, n = 1000, and a small value of *B* there are significant p-values under the three models only when the Standard Deviation is small. Also, for a large sample size and a medium value of *B*, there are significant p-values under the three models when the Standard Deviation is small; but also under the Difference Score model and ANCOVA model when the Standard Deviation and p are large. If the sample size and *B* are large, then all of the p-values are significant.

**APPENDIX**

This appendix contains the tables that present the estimation of the difference effect between groups, the statistical tests, and the p-values for the three parametric models fitted using simulation.

]]>

**DISCUSSION**

Under the three parametric models one can get an unbiased estimator for the difference effect between groups. When the correlation of the outcome variable at baseline and follow-up is less than 0.5, the variance of the estimator of the difference effect under the Simple model is less than that under the Difference score model. But if the correlation is bigger than 0.5, the variance of the estimator under the Difference Score model is less than that under the Simple model. However, in general the variance of the estimator of the difference effect under the ANCOVA model is the smallest.

The Simple model should be used under the belief that randomization produces balance in the outcome variable at baseline.

The Difference Score model should be used when randomization produces an imbalance in the outcome variable for treatment and control group, especially in smaller studies or also when a "change" variable is the outcome of interest.

The ANCOVA model is recommended when randomization produces an imbalance in the baseline value of the outcome variable for treatment and control group, especially in smaller studies.

From the simulation approach, if the correlation of the outcome variable at baseline and follow-up is small, one can fit either the Simple model or the ANCOVA model. Also, if this correlation is large one can fit either the Difference Score model or the ANCOVA model.

The overall recommendation of this paper is the complementary use of nonparametric methods and parametric methods for analyzing the data coming from an experimental study.

]]>**REFERENCES**

Friedman, L.; Furberg, C. & DeMets, D. Fundamentals of Clinical Trials. 2nd ed. St Louis, MosbyYear Book, 1985. [ Links ]

Harrell, J. S.; McMurray, R. G.; Bangdiwala, S. I.; Frauman, A. C.; Gansky, S. A. & Bradley, C. B. Effects of a school-based intervention to reduce cardiovascular disease risk factors in elementary-school children: The Cardiovascular Health in Children (CHIC) Study. J. Pediatr., 128(6):797-805, 1996. [ Links ]

Kleinbaum, D.; Kupper, L.; Muller, K. & Nizam, A. Applied Regression Analysis and Multivariable Methods. 3rd ed. Pacific Grove, Brooks/Cole Publishing Company, 1998. [ Links ]

Koch, G. G.; Amara, I. A.; Davis, G. W. & Gillings, D. B. A review of some statistical methods for covariance analysis of categorical data. Biometrics, 38(3):563-95, 1982. [ Links ]

Koch, G.; Carr, G.; Amara, I.; Stokes, M. & Uryniak, T. Categorical data analysis in Statistical Methodology in the Pharmaceutical Sciences. New York, Marcel Dekker Inc., 1990. pp.391-475. [ Links ]

Piantadosi, S. Clinical Trials: a methodologic Perspective. 2nd ed. New York, John Wiley & Sons Inc., 1997. [ Links ]

Quade, D. Nonparametric analysis of covariance by matching. Biometrics, 38(3):597-611, 1982. [ Links ]

Wilcoxon, F. Individual comparison by ranking methods. Biometrics, 1(6):80-3, 1945. [ Links ]

**Dr. Antonio Sanhueza**

Department of Mathematics and Statistics ]]>
Universidad de La Frontera

Temuco

CHILE

Email: antonio.sanhueza@ufrontera.cl

Received: 29-06-2013

Accepted: 19-12-2013