03 Jul Univariate and multivariate analyses
- We can consider the following three types of analyses: descriptive analyses, univariate analyses and multivariate (or multivariable) analyses
- Descriptive analyses are used to describe the data, and are useful for detecting problems
- Univariate and multivariate analyses allow statistical comparisons (obtaining a p-value), and only multivariate analyses allow confounding factors to be taken into account
Before starting a statistical analysis, it is necessary to have a good knowledge of your data. What is the proportion of women? How old is the oldest patient?
Descriptive analyses answer these questions, and have the advantage of:
- identifying outliers, i. e. patients with extreme values.
- To check the distribution of the data: are they normally distributed?
Imagine that in the age column, a patient is 182 years old; it is likely (unless you are doing a study on the Jedi) that there was a mistake somewhere.
It will therefore be necessary to find the real age of this patient or assign him a missing value. If this error is not detected and corrected, then statistical analyses taking into account age will be completely wrong.
Carrying out a descriptive analysis is therefore a prerequisite for any statistical analysis, whether univariate or multivariate.
Graphs are an integral part of descriptive analyses because they allow you to quickly visualize the structure of your data.
Once you have selected the variables you want to describe, pvalue.io automatically creates a table and graph.
If the variable is quantitative, the table contains mean, standard deviation, median, 25th and 75th percentile, minimum and maximum; the graph then represents the distribution of the variable in the form of a histogram.
If the variable is qualitative, the table provides the number of subjects in each class; the graph represents the distribution in each class in the form of a bar graph.
Univariate analyses make it possible to specify the relationship between two variables: is blood pressure (variable 1) different according to sex (variable 2)? Is the proportion of smokers different according to eye colour? etc.
The purpose of univariate analyses is to answer the question: is the difference observed between my patients a real difference or is it due to chance? Univariate analyses are based on statistical tests, which provide a p-value (which is the probability that the observed difference is due to chance):
If one variable is numerical and the other qualitative
- To compare 2 groups of patients (for example: gender)
- More than 30 patients in each group: Welch T-Test
- Less than 30 patients in at least one group: Mann-Whitney test
- To compare more than two groups (for example: eye colour)
- More than 30 patients per group: Anova (analysis of variance)
- Less than 30 patients in at least one group: Kruskal-Wallis test
If both variables are quantitative
pvalue.io will automatically perform the Pearson correlation test
If both variables are qualitative
- Chi2 test if the expected number of subjects in all cells of the cross-table is greater than 5
- Fisher test if not
pvalue.io will automatically perform these tests in a table and generate:
- A boxplot if you cross a numerical variable with a qualitative variable
- A bar plot if you cross two qualitative variables
- A survival curves if you perform survival analyses
- A scatterplot if you cross two numerical variables, as well as the calculated linear relationship.
Be careful, univariate analyses do not allow for confounding factors to be taken into account. Let us take a sample in which women are younger than men. We want to know if the treatment has a different effect on survival by gender. If we find a p < 0.05, is it because of sex or because of age?
Only a randomized trial can guarantee comparability of patient characteristics between groups. In this study design and only this one, univariate analyses alone are sufficient. Outside of a randomized trial, it is necessary to adjust for confounding factors. This is the purpose of multivariate analyses.
Multivariate analyses allow confounding factors to be taken into account, by adjusting for these factors. They are therefore recommended when attempting to identify a statistical link between several variables. Multivariate analyses use more sophisticated statistical methods than univariate analyses, and are rarely available in software for non-statisticians.
In the previous example, the adjustment on age allows us to conclude: if the men and women in my sample were the same age, then the effect of treatment would be (or not) statistically significant.
Statistical models make it possible to obtain p-values. They have a non-negligible additional interest: they make it possible to measure the extent to which a factor affects the outcome variable. These association measures are:
- Odds Ratio for logistics regressions
- Hazard Ratio for Cox models
- Estimates or coefficients for linear regressions
The p-value gives information on statistical significance, the association measures quantify the relationship between two variables.
Statistical models require that a number of conditions are met.