11 Mar Management of Covariates
This page aims to describe the methodological choices regarding both the selection of variables and their modeling. This page is technical. The R code corresponding to the descriptions below is available
here.
Variable selection
Definitions
We differentiate between two types of covariates (variables included into a model but not being the main explanatory/predictor variable):
- Explanatory or predictor variables, depending on the purpose of the analysis, that have clinical significance, and that may be considered confounding variables in the literature
- Extraneous variables for which there is a statistical relationship with the response (explained) variable
In the context of explanatory analyses, pvalue.io displays the results of explanatory variables because we may be interested in quantifying the effect of each explanatory variable on the explained variable. On the other hand, this is not the case for the extraneous variables, which are variables that we simply want to include into the statistical model in order to improve its performance, without quantifying the statistical relationship between this extraneous variable and the explained variable. For predictive analyses, pvalue.io displays the results of the covariates whatever their type.
Selection process
In order to select the extraneous variables, pvalue.io uses a LASSO model between the explaines (or predicted) variable and the explanatory (or predictor) variables. It first determines the
maximum number of covariates. It forces the explanatory variables not to be penalized. It performs a 10-fold cross-validation of a LASSO model adapted to the data to be modeled (linear or logistic regressions and Cox model). We then obtain several penalty parameters
λ. The default value that pvalue.io uses is the largest value of λ for which the cross-validation error is within 1 standard error of the minimum cross-validation error. If using this parameter, the number of non-zero coefficients is less than the maximum number of covariates, we keep this parameter. Otherwise, we choose the highest value of
λ that provides a number of coefficients equal to the maximum number of covariates. The feasibility of the model is then checked. In order to meet the
validity conditions of the statistical models, the algorithm removes the covariates that present a multicollinearity determined if the VIF (Variance Inflation Factor) is greater than 5.
Use of covariates in statistical regression models
The user of pvalue.io checks
the (log)linearity of all numerical covariates. If the condition is not meet, a transformation is proposed to the user:
- If thresholds are determined in the literature or if the user wants a split into quantiles, then the numerical variable is transformed into a categorical variable.
- Otherwise
- If the analysis is explanatory,
- The explanatory variables are transformed into categorical variables by the user (at the positions indicated by the user), because in this case the interpretation of the coefficient is necessary
- The extraneous variables undergo a natural cubic spline transformation (at the positions indicated by the user). Indeed, this transformation allows a better performance of the model than a transformation into categorical variables. This has the consequence of making the coefficients of the statistical models uninterpretable. However, the purpose of a extraneous variable is not to obtain a quantification of the effect of this variable, but to improve the performance of the model. Consequently the interpretation of this relationship is not necessary.
- If the analysis is predictive, the covariates undergo a natural cubic spline transformation whatever their type. Indeed, for this type of analysis, the performance of the model is more important than its interpretation.
No Comments