In this ar we talk about correlation analysis which is a an approach used to quantify the associations in between two continuous variables. For example, we might want to quantify the association in between body massive index and also systolic blood pressure, or between hours of practice per week and percent human body fat. Regression evaluation is a related method to evaluate the connection between an outcome variable and also one or more risk factors or confounding variables (confounding is questioned later). The outcome variable is also called the **response** or **dependent variable,** and also the danger factors and also confounders are referred to as the **predictors**, or **explanatory** or **independent variables**. In regression analysis, the dependent variable is denoted "Y" and also the live independence variables are denoted through "X".

You are watching: In regression analysis, what is the predictor variable called?

< **NOTE:** The hatchet "predictor" can be misleading if the is understood as the capability to suspect even beyond the borders of the data. Also, the term "explanatory variable" could give one impression of a causal result in a situation in i m sorry inferences must be restricted to identifying associations. The terms "independent" and "dependent" variable are much less subject to this interpretations as they perform not strongly imply cause and also effect.

*After completing this module, the student will be maybe to:*

In correlation analysis, we estimate a **sample correlation coefficient**, an ext specifically the **Pearson Product moment correlation coefficient**. The sample correlation coefficient, denoted r,

ranges between -1 and +1 and also quantifies the direction and strength the the straight association between the two variables. The correlation between two variables deserve to be confident (i.e., higher levels the one variable are connected with higher levels that the other) or an adverse (i.e., higher levels of one change are associated with reduced levels the the other).

The authorize of the correlation coefficient shows the direction that the association. The size of the correlation coefficient shows the toughness of the association.

For example, a correlation the r = 0.9 says a strong, positive association between two variables, conversely, a correlation of r = -0.2 imply a weak, an unfavorable association. A correlation close to zero says no linear association between two constant variables.

It is essential to keep in mind that there might be a non-linear association in between two continuous variables, but computation the a correlation coefficient does not detect this. Therefore, that is always important to evaluate the data carefully before computing a correlation coefficient. Graphical display screens are an especially useful to explore associations between variables.

The figure listed below shows four hypothetical scenarios in i m sorry one continuous variable is plotted follow me the X-axis and the other along the Y-axis.

Scenario 1 depicts a solid positive association (r=0.9), comparable to what we might see because that the correlation between infant birth weight and birth length.Scenario 2 depicts a weaker combination (r=0,2) the we can expect to see in between age and also body mass table of contents (which tends to increase with age).Scenario 3 might depict the lack of association (r approximately = 0) in between the level of media exposure in adolescence and also age at which adolescents initiate sexual activity.Scenario 4 might depict the strong negative association (r= -0.9) generally observed in between the number of hours of aerobic practice per week and percent human body fat.## Example - Correlation of Gestational Age and Birth Weight

A little study is carried out involving 17 infants to investigate the association between gestational period at birth, measured in weeks, and birth weight, measure in grams.

Infant i would #

Gestational age (weeks)

Birth load (grams)

1 | 34.7 | 1895 |

2 | 36.0 | 2030 |

3 | 29.3 | 1440 |

4 | 40.1 | 2835 |

5 | 35.7 | 3090 |

6 | 42.4 | 3827 |

7 | 40.3 | 3260 |

8 | 37.3 | 2690 |

9 | 40.9 | 3285 |

10 | 38.3 | 2920 |

11 | 38.5 | 3430 |

12 | 41.4 | 3657 |

13 | 39.7 | 3685 |

14 | 39.7 | 3345 |

15 | 41.1 | 3260 |

16 | 38.0 | 2680 |

17 | 38.7 | 2005 |

We great to calculation the association between gestational age and infant bear weight. In this example, birth load is the dependence variable and gestational age is the elevation variable. For this reason y=birth weight and x=gestational age. The data are displayed in a scatter chart in the number below.

Each suggest represents an (x,y) pair (in this situation the gestational age, measure up in weeks, and the bear weight, measure up in grams). Keep in mind that the live independence variable, gestational age) is top top the horizontal axis (or X-axis), and the dependent variable (birth weight) is ~ above the vertical axis (or Y-axis). The scatter plot reflects a optimistic or straight association in between gestational age and birth weight. Babies with much shorter gestational periods are more likely to be born with reduced weights and also infants with longer gestational periods are an ext likely to be born with greater weights.

Computing the Correlation CoefficientThe formula for the sample correlation coefficient is:

where Cov(x,y) is the covariance that x and y defined as

and also are the sample variances the x and also y, characterized as follows: and alsoThe variances the x and y measure up the variability that the x scores and y scores approximately their particular sample means of X and Y considered separately. The covariance measures the variability of the (x,y) pairs around the median of x and mean that y, taken into consideration simultaneously.

To compute the sample correlation coefficient, we have to compute the variance of gestational age, the variance of birth weight, and likewise the covariance that gestational age and also birth weight.

We an initial summarize the gestational age data. The average gestational age is:

To compute the variance the gestational age, we have to sum the squared deviations (or differences) in between each observed gestational age and also the typical gestational age. The computations are summarized below.

Infant id #

Gestational age (weeks)

1 | 34.7 | -3.7 | 13.69 |

2 | 36.0 | -2.4 | 5.76 |

3 | 29.3 | -9.1 | 82,81 |

4 | 40.1 | 1.7 | 2.89 |

5 | 35.7 | -2.7 | 7.29 |

6 | 42.4 | 4.0 | 16.0 |

7 | 40.3 | 1.9 | 3.61 |

8 | 37.3 | -1.1 | 1.21 |

9 | 40.9 | 2.5 | 6.25 |

10 | 38.3 | -0.1 | 0.01 |

11 | 38.5 | 0.1 | 0.01 |

12 | 41.4 | 3.0 | 9.0 |

13 | 39.7 | 1.3 | 1.69 |

14 | 39.7 | 1.3 | 1.69 |

15 | 41.1 | 2.7 | 7.29 |

16 | 38.0 | -0.4 | 0.16 |

17 | 38.7 | 0.3 | 0.09 |

The variance of gestational age is:

Next, us summarize the birth weight data. The median birth weight is:

The variance the birth load is computed just as we did for gestational age as presented in the table below.

Infant ID#

Birth Weight

1 | 1895 | -1007 | 1,014,049 |

2 | 2030 | -872 | 760,384 |

3 | 1440 | -1462 | 2,137,444 |

4 | 2835 | -67 | 4,489 |

5 | 3090 | 188 | 35,344 |

6 | 3827 | 925 | 855,625 |

7 | 3260 | 358 | 128,164 |

8 | 2690 | -212 | 44,944 |

9 | 3285 | 383 | 146,689 |

10 | 2920 | 18 | 324 |

11 | 3430 | 528 | 278,764 |

12 | 3657 | 755 | 570,025 |

13 | 3685 | 783 | 613,089 |

14 | 3345 | 443 | 196,249 |

15 | 3260 | 358 | 128,164 |

16 | 2680 | -222 | 49,284 |

17 | 2005 | -897 | 804,609 |

The variance of birth weight is:

Next we compute the covariance:

To compute the covariance that gestational age and birth weight, we must multiply the deviation native the typical gestational period by the deviation from the average birth load for each participant, that is:

The computations room summarized below. An alert that we simply copy the deviations indigenous the average gestational age and also birth load from the 2 tables above into the table below and multiply.

Infant ID#

1 | -3.7 | -1007 | 3725.9 |

2 | -2.4 | -872 | 2092.8 |

3 | -9,1 | -1462 | 13,304.2 |

4 | 1.7 | -67 | -113.9 |

5 | -2.7 | 188 | -507.6 |

6 | 4.0 | 925 | 3700.0 |

7 | 1.9 | 358 | 680.2 |

8 | -1.1 | -212 | 233.2 |

9 | 2.5 | 383 | 957.5 |

10 | -0.1 | 18 | -1.8 |

11 | 0.1 | 528 | 52.8 |

12 | 3.0 | 755 | 2265.0 |

13 | 1.3 | 783 | 1017.9 |

14 | 1.3 | 443 | 575.9 |

15 | 2.7 | 358 | 966.6 |

16 | -0.4 | -222 | 88.8 |

17 | 0.3 | -897 | -269.1 |

Total = 28,768.4 |

The covariance of gestational age and also birth load is:

Finally, we have the right to ow compute the sample correlation coefficient:

Not surprisingly, the sample correlation coefficient indicates a solid positive correlation.

As we noted, sample correlation coefficients range from -1 to +1. In practice, meaningful correlations (i.e., correlations that space clinically or nearly important) deserve to be as small as 0.4 (or -0.4) for hopeful (or negative) associations. There are likewise statistical exam to recognize whether an it was observed correlation is statistically far-ranging or not (i.e., statistically considerably different indigenous zero). Procedures to test whether an observed sample correlation is suggestive the a statistically significant correlation are described in detail in Kleinbaum, Kupper and also Muller.1

Regression AnalysisRegression analysis is a commonly used method which is advantageous for numerous applications. We introduce the an approach here and also expand ~ above its provides in subsequent modules.

## Simple direct Regression

Simple straight regression is a method that is ideal to know the association between one independent (or predictor) variable and one consistent dependent (or outcome) variable. Because that example, intend we want to evaluate the association between total cholesterol (in milligrams per deciliter, mg/dL) and also body mass table of contents (BMI, measured together the proportion of load in kilograms to elevation in meters2) where full cholesterol is the dependence variable, and also BMI is the independent variable. In regression analysis, the dependent variable is denoted Y and the independent change is denoted X. So, in this case, Y=total cholesterol and also X=BMI.

When over there is a single consistent dependent variable and a single independent variable, the evaluation is called a simple linear regression analysis . This analysis assumes the there is a direct association between the two variables. (If a different relationship is hypothesized, such together a curvilinear or exponential relationship, alternative regression analyses are performed.)

The figure below is a scatter diagram showing the relationship in between BMI and total cholesterol. Each point represents the observed (x, y) pair, in this case, BMI and also the corresponding full cholesterol measure in every participant. Keep in mind that the independent change (BMI) is ~ above the horizontal axis and the dependent change (Total Serum Cholesterol) on the vertical axis.

**BMI and Total Cholesterol**

The graph shows that there is a confident or straight association in between BMI and total cholesterol; participants with lower BMI are more likely to have actually lower complete cholesterol levels and participants with higher BMI are much more likely come have higher total cholesterol levels. In contrast, expect we research the association in between BMI and also HDL cholesterol.

In contrast, the graph below depicts the relationship in between BMI and **HDL cholesterol** in the exact same sample of n=20 participants.

**BMI and also HDL Cholesterol**

This graph mirrors a an adverse or train station association between BMI and also HDL cholesterol, i.e., those with reduced BMI are more likely to have greater HDL cholesterol levels and also those with higher BMI are an ext likely to have actually lower HDL cholesterol levels.

For either of these relationship we might use an easy linear regression evaluation to estimate the equation of the heat that finest describes the association in between the elevation variable and the dependent variable. The straightforward linear regression equation is together follows:

where **Y** is the suspect or meant value that the outcome, **X** is the predictor, **b0** is the estimated Y-intercept, and also **b1** is the estimated slope. The Y-intercept and also slope are estimated from the sample data, and also they are the worths that minimize the sum of the squared differences in between the observed and also the predicted worths of the outcome, i.e., the estimates minimize:

These differences in between observed and predicted values of the outcome are referred to as **residuals**. The approximates of the Y-intercept and also slope minimization the amount of the squared residuals, and are referred to as the **least squares estimates**.1

Residuals Conceptually, if the values of X noted a perfect forecast of Y climate the sum of the squared differences in between observed and also predicted values of Y would be 0. That would average that variability in Y could be fully explained by distinctions in X. However, if the differences between observed and also predicted values room not 0, then we room unable to completely account for distinctions in Y based on X, then there space residual errors in the prediction. The residual error could an outcome from inaccurate dimensions of X or Y, or there can be other variables besides X that influence the worth of Y. |

Based top top the observed data, the best estimate of a linear relationship will certainly be derived from an equation for the line that minimizes the differences between observed and predicted worths of the outcome. The **Y-intercept** of this heat is the value of the dependent change (Y) once the independent change (X) is zero. The **slope** of the line is the adjust in the dependent change (Y) relative to a one unit readjust in the independent variable (X). The least squares approximates of the y-intercept and also slope are computed together follows:

and

where

r is the sample correlation coefficient,the sample way are and also and Sx and also Sy room the typical deviations the the independent variable x and also the dependent variable y, respectively.### BMI and Total Cholesterol

The least squares approximates of the regression coefficients, b 0 and also b1, relenten the relationship in between BMI and total cholesterol space b0 = 28.07 and also b1=6.49. These room computed together follows:

and

The calculation of the Y-intercept (b0 = 28.07) to represent the estimated total cholesterol level when BMI is zero. Because a BMI the zero is meaningless, the Y-intercept is no informative. The estimate of the slope (b1 = 6.49) represents the readjust in total cholesterol loved one to a one unit change in BMI. For example, if we compare two participants who BMIs differ by 1 unit, we would expect their full cholesterols to different by roughly 6.49 units (with the person with the higher BMI having the greater total cholesterol).

The equation of the regression line is together follows:

The graph listed below shows the estimated regression line superimposed on the scatter diagram.

The regression equation can be offered to calculation a participant"s total cholesterol as a role of his/her BMI. For example, mean a participant has a BMI of 25. We would estimate their complete cholesterol to be 28.07 + 6.49(25) = 190.32. The equation can also be offered to estimate full cholesterol for various other values the BMI. However, the equation need to only be offered to calculation cholesterol levels because that persons who BMIs room in the selection of the data offered to create the regression equation. In our sample, BMI varieties from 20 come 32, for this reason the equation should only be supplied to generate estimates of total cholesterol because that persons v BMI in the range.

There room statistical tests that deserve to be carry out to evaluate whether the approximated regression coefficients (b0 and also b1) room statistically significantly different from zero. The test of many interest is typically H0: b1=0 versus H1: b1≠0, wherein b1 is the populace slope. If the populace slope is significantly different indigenous zero, us conclude that there is a statistically far-ranging association between the independent and dependent variables.

### BMI and also HDL Cholesterol

The the very least squares estimates of the regression coefficients, b0 and also b1, explicate the relationship between BMI and also HDL cholesterol room as follows: b0 = 111.77 and b1 = -2.35. These space computed together follows:

and

Again, the Y-intercept in uninformative due to the fact that a BMI of zero is meaningless. The calculation of the steep (b1 = -2.35) to represent the readjust in HDL cholesterol family member to a one unit change in BMI. If we compare 2 participants whose BMIs differ by 1 unit, we would mean their HDL cholesterols to different by approximately 2.35 units (with the human being with the higher BMI having actually the lower HDL cholesterol. The figure listed below shows the regression line superimposed ~ above the scatter diagram for BMI and also HDL cholesterol.

Linear regression evaluation rests top top the presumption that the dependent change is constant and that the distribution of the dependent variable (Y) at each worth of the independent variable (X) is around normally distributed. Note, however, the the live independence variable can be consistent (e.g., BMI) or can be dichotomous (see below).

Comparing mean HDL Levels with Regression AnalysisConsider a clinical trial to advice the efficacy of a brand-new drug to rise HDL cholesterol. We could compare the average HDL levels in between treatment groups statistically making use of a two independent samples t test. Right here we consider an alternating approach. Summary data because that the attempt are displayed below:

Sample Size

Mean HDL

Standard Deviation the HDL

New Drug

Placebo

50 | 40.16 | 4.46 |

50 | 39.21 | 3.91 |

HDL cholesterol is the continuous dependent variable and also treatment assignment (new medicine versus placebo) is the elevation variable. Suppose the data on n=100 entrants are gotten in into a statistical computing package. The result (Y) is HDL cholesterol in mg/dL and also the independent variable (X) is treatment assignment. Because that this analysis, X is coded together 1 because that participants who obtained the new drug and also as 0 because that participants who got the placebo. A an easy linear regression equation is estimated as follows:

where Y is the estimated HDL level and X is a dichotomous variable (also called an indicator variable, in this instance indicating even if it is the participant was assigned to the brand-new drug or come placebo). The calculation of the Y-intercept is b0=39.21. The Y-intercept is the value of Y (HDL cholesterol) when X is zero. In this example, X=0 shows assignment come the placebo group. Thus, the Y-intercept is exactly equal to the typical HDL level in the placebo group. The steep is approximated as b1=0.95. The slope represents the estimated adjust in Y (HDL cholesterol) loved one to a one unit adjust in X. A one unit readjust in X to represent a distinction in therapy assignment (placebo versus brand-new drug). The slope to represent the difference in median HDL levels in between the therapy groups. Thus, the average HDL for participants receiving the new drug is:

-----A study was carried out to assess the association between a person"s intelligence and also the size of your brain. Participants completed a standardization IQ test and researchers supplied Magnetic Resonance Imaging (MRI) to determine mind size. Demographic information, including the patient"s gender, was likewise recorded.

The conflict Over eco-friendly Tobacco exhilaration ExposureThere is convincing proof that energetic smoking is a *cause* the lung cancer and also heart disease. Plenty of studies excellent in a wide range of circumstances have actually consistently demonstrated a strong association and additionally indicate the the danger of lung cancer and cardiovascular disease (i.e.., love attacks) boosts in a dose-related way. These studies have actually led come the conclusion that energetic smoking is causally regarded lung cancer and cardiovascular disease. Studies in energetic smokers have had actually the benefit that the life time exposure to tobacco smoke deserve to be quantified v reasonable accuracy, due to the fact that the unit sheep is continual (one cigarette) and also the habitual nature of tobacco smoking makes it possible for many smokers to provide a reasonable calculation of their complete lifetime exposure quantified in regards to cigarettes every day or packs per day. Frequently, average everyday exposure (cigarettes or packs) is an unified with duration of usage in year in order to quantify exposure as "pack-years".

It has been much more an overwhelming to create whether eco-friendly tobacco smoke (ETS) exposure is causally related to chronic diseases like heart condition and lung cancer, because the full lifetime exposure dosage is lower, and it is lot more difficult to correctly estimate total lifetime exposure. In addition, quantifying these risks is also complex because that confounding factors. For example, ETS exposure is typically classified based on parental or spousal smoking, but these studies room unable come quantify other environmental exposures to tobacco smoke, and inability come quantify and adjust for other ecological exposures such together air pollution makes it complicated to demonstrate an association also if one existed. Together a result, there proceeds to be conflict over the risk imposed by ecological tobacco acting (ETS). Some have actually gone so far regarding claim the even an extremely brief exposure come ETS can reason a myocardial infarction (heart attack), but a very large prospective cohort examine by Enstrom and Kabat to be unable to demonstrate far-ranging associations in between exposure to spousal ETS and also coronary love disease, chronic obstructive pulmonary disease, or lung cancer. (It have to be noted, however, that the report by Enstrom and Kabat has actually been commonly criticized for methodological problems, and also these authors additionally had gaue won ties come the tobacco industry.)

Correlation analysis provides a advantageous tool because that thinking around this controversy. Consider data from the British medical professionals Cohort. They reported the yearly mortality for a range of an illness at four levels that cigarette smoking per day: never smoked, 1-14/day, 15-24/day, and also 25+/day. In order to carry out a correlation analysis, ns rounded the exposure level to 0, 10, 20, and 30 respectively.

Cigarettes Smoked

Per Day

CVD Mortality

Per 100,000 males Per Year

Lung Cancer Mortality

Per 100,000 males Per Year

0

10 (actually 1-14)

20 (actually 15-24)

30 (actually >24)

572 | 14 |

802 | 105 |

892 | 208 |

1025 | 355 |

The figures listed below show the two approximated regression currently superimposed ~ above the scatter diagram. The correlation through amount of smoking cigarettes was solid for both CVD mortality (r= 0.98) and for lung cancer (r = 0.99). Note also that the Y-intercept is a meaningful number here; it represents the predicted annual death rate from these condition in individuals who never smoked. The Y-intercept because that prediction that CVD is slightly higher than the observed rate in never smokers, when the Y-intercept for lung cancer is lower than the observed price in never smokers.

The linearity of these relationships says that over there is one incremental risk with each additional cigarette smoked per day, and also the added risk is estimated by the slopes. This maybe helps us think around the after-effects of ETS exposure. Because that example, the hazard of lung cancer in never smokers is rather low, yet there is a finite risk; miscellaneous reports imply a risk of 10-15 lung cancers/100,000 every year. If an individual who never ever smoked proactively was exposed come the indistinguishable of one cigarette"s smoke in the form of ETS, then the regression suggests that their threat would rise by 11.26 lung cancer deaths per 100,000 every year. However, the threat is plainly dose-related. Therefore, if a non-smoker to be employed through a tavern with heavy levels that ETS, the risk might be dramatically greater.

Finally, it should be detailed that some findings suggest that the association between smoking and heart an illness is non-linear in ~ the very lowest exposure levels, an interpretation that non-smokers have a disproportionate rise in risk once exposed come ETS early out to rise in platelet aggregation.

SummaryCorrelation and also linear regression analysis are statistical approaches to quantify associations in between an independent, sometimes dubbed a predictor, change (X) and also a continuous dependent outcome variable (Y). Because that correlation analysis, the independent variable (X) can be constant (e.g., gestational age) or ordinal (e.g., enhancing categories the cigarettes per day). Regression evaluation can likewise accommodate dichotomous elevation variables.

See more: Wire In The Blood Meaning - Wire In The Blood Definition

The procedures explained here assume the the association in between the independent and dependent variables is **linear**. Through some adjustments, regression evaluation can likewise be offered to estimate associations the follow another functional type (e.g., curvilinear, quadratic). Below we take into consideration associations between one independent variable and also one constant dependent variable. The regression analysis is called basic linear regression - an easy in this situation refers come the fact that there is a **single independent variable**. In the next module, we consider regression evaluation with numerous independent variables, or predictors, thought about simultaneously.