11 Jul Missing data handling
- When a parameter has not been measured for all patients in the study, we are talking about missing data
- There are few studies without missing data
- If missing data are present, they should be described and a strategy chosen to address them
Missing data is a common problem in the life sciences, and there are many reasons for their presence (memory bias, loss to follow-up, data collected retrospectively from medical records, etc.).
This is a challenge, because in addition to the fact that their presence reduces the power of the study (by reducing the sample size of the study — statistical models only use patients with all the parameters of interest filled in —), they can cause significant biases.
Some of these missing data may be due to chance (for example, if a physician correctly fills out 3 out of 4 case report forms). In this case, the sample remains representative of the study population.
Problems arise when missing data are not random (for example, if a questionnaire is submitted to depressed patients to measure their depression level and only the least depressed are able to complete it correctly).
To differentiate between random and non-random missing data, it is necessary to describe the data to determine whether patients with missing data have the same characteristics as those without missing data.
Several methods exist to deal with missing data, but there is no consensus.
To compensate for the loss of power due to missing data, statisticians use imputation techniques. It is indeed a pity to omit the information of all the data collected for a patient if there is only one missing parameter. Imputation is the process of assigning a certain value to the missing data.
A frequently used imputation technique is imputation by the median. The median value of the parameter is assigned to all patients with this parameter missing.
Other imputation techniques are used, including multiple imputation by chain equation, which consists of assigning the most probable value to the missing data based on the patient’s other parameters and repeating this operation several times. This technique is based on regression models and allows data to be imputed more reliably, especially when the missing data are not due to chance.
How pvalue.io handles missing data
pvalue.io first filters parameters (i.e. the columns of your file) with more than 20% missing data: these variables cannot be input into the statistical models.
- When a parameter has missing data, pvalue.io indicates to the user the number and proportion of patients with at least one missing data. This is the proportion that would be excluded from the analysis in the absence of specific processing of these data.
- When the proportion of patients with at least one missing data is less than 5%, these patients are excluded from the model.
- When a parameter has less than 5% missing data, pvalue.io impute by median for quantitative variables, and by mode for qualitative variables
- When a parameter has more than 5% missing data, a chain equation imputation is performed.