16 Jul How to prepare a file for statistical analysis?
That’s it, you have a study to carry out? Great! We have prepared a list of 10 rules to follow in order to make your data analyzable by a biostatistics software.
Which data entry software to choose?
AFirst of all, it is necessary to define on which file format you will input the data. We suggest that you use Microsoft Excel or LibreOffice Calc to build the database, using only the first sheet. Other software are excellent, for free, and specialized for data entry (notably Epidata Entry), but require more training than Excel or Calc. If you don’t know how to use Excel, visit our dedicated page.
Before the entry: how to structure the database?
1- Before collecting the data, think about which data will really be interesting to analyse
Too often, the number of variables to be collected is much higher than those actually analyzed. This considerably increases the work of the person in charge of the collection (which is always tedious), and the risk of obtaining missing data that could reduce the power of the study. For a “small” study, it is not unreasonable to have less than 10 columns! Avoid open comments, but classify them a priori. If this is not possible, perform this operation once the entry is complete (see below).
It is also important to limit the number of classes for a non-numerical column: the ideal is to have less than 5 classes, and never more than 10.
2- Only one row per patient, only one characteristic per column
And that’s it: you don’t want to have any summary lines or tables inside your Excel sheet. There should also be no merged columns or rows.
3- The first row must be the name of the variable
We recommend that you use an explicit and short name. This will allow you to produce graphics without having to rewrite the axis titles. Biostatistics software pvalue.io allows names up to 40 characters (truncated beyond) and supports spaces.
The classes of categorical (non-numerical) variables should also be explicit (not numbers). By default, the software pvalue.io admits that variables with less than 5 different values are categorical variables.
4- No column is the opposite of another
In the example below, it is not necessary to have both columns. A single column “Sex”, coded “Male” for men and “Female” for women is more appropriate.
5- There cannot be any two columns, one of which is a transformation of another
In the example below, “Age in class” is a transformation of “Age”. Only one should be kept; in this example “Age” (cf règle n° 6).
6- Do not code numeric variables in classes
It is always preferable to keep a numerical column as long as the conditions of statistical methods are met. There is indeed a significant loss of information through this transformation.
This rule allows an exception if the thresholds are defined in the literature or usually used.
How to enter data?
7- No unit of measurement or percentage in a numerical column.
As a general rule, no non-numeric characters should be put in a numerical column. For example, do not write 10% or 134mmHg.
8- Keep the same spelling for the classes of categorical variables
Your biostatistics software may be very intelligent, but it will identify MI and myocardial infarction as two different classes.
9- Leave the cells empty for missing data
Do not use other values such as “?”, “.”, NA, etc.
10- Do not use any personal data
Ideally, you should use an anonymization number for each of your patients. This anonymization number can simply be a sequential number of inclusion in the study. To find the patient who matches this anonymization number, you can use another Excel file or write on plain paper (but be careful not to lose it) with two columns: the patient’s name and the anonymization number. This file must be kept safe in the department from which the patients are extracted and is not allowed to be moved or copied.
After the entry: data management
The biostatistics software pvalue.io is not designed as a data management tool. This means that it cannot perform an operation between several columns (for example, compute a score).
It is therefore necessary to first duplicating your file (once the entry is complete, you should no longer touch the raw file“). You will then be able to create new variables, that can be processed by the software.
Before uploading your data file to the software, you will need to delete the anonymization number and hospital name, and replace the birth dates with the ages.