04 Jul Linear regressions

Posted at 12:10h in Statistical Methods by Kevin 0 Comments

When the outcome variable is numerical and continuous, the appropriate statistical model is the linear regression
When there is only one explanatory variable which is categorical, linear regression yields a result close to a Welch or Student T test

In an attempt to simplify this, we will name Y the variable that we want to explain by factors X. (Use your distant memories: Y = aX +b)
For example, if we want to study the size of a child according to the size of her mother, Y is the size of the child and X is the size of the mother.

What’s the purpose of this?

Traditional statistical tests (Student test, Chi2 test for the most commonly used in medicine) determine whether the differences observed between 2 or more groups can be the result of chance by sampling fluctuation (it is then said that the null hypothesis of the absence of difference cannot be rejected) or whether such a difference cannot be due to chance (rejection of the null hypothesis).
These univariable tests, raise a key issue: they do not take into account potential confounding factors. However, these are common in medicine. It is therefore necessary to use more complex statistical methods, known as statistical regression models (Wikipedia).
Thus, it is possible to test each of the factors X that may have an influence on the variable Y, and to give them a weight (or a coefficient).

Assumptions of linear regression

There are always assumptions to check for statistical models. If you would like to know more, we suggest you to read the following post.

Confounding factors

Let’s imagine that we want to find out if coffee drinkers are more likely to develop lung cancer than non-coffee drinkers. If we do a simple statistical test, we will find that there is a significant association between the the coffee intake and the lung cancer. However, in this case, not adjusting would be a mistake, as it is necessary to take into account (among other things) smoking as a confounding variable.
The significant association found by the test would be due both to the statistical association between smoking and cancer, and to the frequency of coffee consumption more frequent among smokers, thus constituting a known confounding bias.

How to perform linear regressions with pvalue.io

Let the intuitive software interface guide you

Choose to perform an explanatory analysis
Select the outcome variable (Y) and the factors known to influence the outcome variable (X)
Check that no errors are found according to the descriptive analysis (by looking at the graphs and tables)
Transform variables that are not linearly related to the outcome variable
That’s it

If the assumptions of linear regression are not met, pvalue.io will inform you if action is required.

Interpretation of the results

The coefficients

Numerical variable

The coefficients represent the variation of Y when the value of X increases by 1 unit.

Categorical variable

The coefficients represent the variation of Y when the categorical variable is equal to the value of the class (in relation to the reference class)

The p-values

It is common to set the alpha risk at 5%: it corresponds to the risk that one would assume a priori to conclude wrongly that a coefficient at least as high as this is not due to chance. In other words, it is the risk of wrongly concluding that the results obtained cannot be due to chance.
The p-value is computed a posteriori and corresponds to the probability that one has to observe a coefficient at least as high only because of chance.
Thus, when the p-value is lower than the alpha risk, the null hypothesis of nullity of the coefficient is rejected.

When a categorical (qualitative) variable has more than 2 classes, it is possible to calculate an global p-value for the class; this p-value corresponds to the test of the nullity of the coefficient when the class is not the reference.

In the table below, we wanted to know if the child’s birth weight was related to the mother’s age (Age Mother), the child’s sex, the rank of pregnancy and the fact that he or she has a malformation.

		Estimation [95% CI]	p	p global
Age mother		4.45 [-0.152, 9.0]	0.058
Sex	M vs F	138 [100, 180]	<0.001
Rank of pregnancy	gemellar vs single	-285 [-335, -234]	<0.001	<0.001
	triple vs single	-442 [-589, -295]	<0.001
Malformation	yes vs no	-71.4 [-138, -4.87]	0.035

We conclude as follows:

The mother’s age does not influence the child’s weight (p > 0.05); for each additional year, the child’s weight increases by 4.45g, with a confidence interval including 0: [-0.152, 9.0]
Being a boy significantly increases the child’s weight (+138g [100, 180])
Gemellar pregnancy significantly reduces the child’s weight (-235g[-335, -234]) compared to a single pregnancy
A triple pregnancy significantly reduces the child’s weight (-442g[-589, -295]) compared to a single pregnancy
Overall, having a multiple pregnancy results in a lower weight in children (global p-value <0.001)
Having a malformation significantly reduces the child’s weight (-71.4g [-138, -4.87])

Print page

04 Jul Linear regressions

What’s the purpose of this?

Assumptions of linear regression

Confounding factors

How to perform linear regressions with pvalue.io

Interpretation of the results

The coefficients

Numerical variable

Categorical variable

The p-values

No Comments

Post A Comment