03 Jul Transformation of numerical variables
- In statistical modeling, it is often necessary to group the numerical variables to create classes in order to meet the conditions of the model.
- If we have no a priori idea about the appropriate grouping, it is preferable to base ourselves on the splines representing the link between the outcome variable and the explanatory variable.
For example, if we want to explain the probability of being born male as a function of diet, Y is the male sex, and X is the diet)
A number of assumptions must be verified in order to be able to use a statistical model (allowing multivariate analysis).
In particular, there must be a linear relationship between the variable Y (if it is a linear regression, or a transformation of Y if it is another type of model) and all quantitative variables X.
When this assumption is not verified, the solution is to transform the numerical variable X (e. g. weight 67kg, 78kg, etc.) into a (qualitative) class variable (e. g. weight 60-70kg, 70-80kg, etc.).
How to transform?
pvalue.io offers 3 different ways to transform a variable:
- Using data from the literature
- Using splines
- Into quantiles
Using data from the literature
When thresholds are commonly accepted or used in medical articles (e. g. BMI), you can use these thresholds themselves.
To simplify, a spline is a graphical representation of the relationship between Y and X. In general, the confidence intervals of the spline are wider at the extreme values of X because clinical parameters are often distributed according to a normal distribution; there are therefore few individuals with values close to the extremes.
If the spline is a straight line, then the relationship is linear between Y and X and therefore this assumption is verified.
If the spline is a curve, two scenarios:
- It is possible to draw a line within the confidence interval, or one that slightly exceeds it; in this case, we can consider that the assumption is verified.
- It is clearly not possible to draw a line; the curve is divided into several increasing and decreasing parts. In this case, it will be necessary to transform the variable by choosing thresholds delimiting the classes.
Caution, however, if the bounds of the confidence interval are wide, it is difficult to conclude, as many lines could be drawn within this interval.
Where to cut?
If the curve can be divided into several parts (e. g. an increasing part, then a decreasing part and then a horizontal part), the optimal position to cut this curve, and therefore to create a class, is the junction between two curve parts.
In the figure below, the optimal position is around 45; the curve is increasing before 45 and horizontal after 45; the confidence interval is too wide before 30 to be able to say that the curve is decreasing.
On pvalue.io, you just have to click on the curve to automatically create the corresponding classes. You can then adjust them (for example, by rounding to the nearest integer or tens).
Most of the time, creating two or three classes (so one or two clicks) is enough.
More commonly, patients can be grouped into quantiles (terciles, quartiles, quintiles).
If patients are ordered according to the value of the studied parameter (the patient with the lowest value is first, the one with the highest value last), the quantiles classify your ordered patients into 3, 4 and 5 groups, respectively, of the same size.
For instance, if I have 66 patients and I want to create age terciles, I will have a group of 22 patients between the youngest and 22nd youngest, a group of 22 patients between the 23rd and 44th, and 22 patients between the 45th and 66th.