Automatic test selection

The purpose of this page is to describe the methodological choices regarding the tests performed on pvalue.io. This page is technical, and is intended for users wondering why one test is performed rather than another. The corresponding R code is available here.

Univariable test

The test performed depends on the type of response variable (variable to explain Y), and the explanatory variable X.

Unpaired tests (independant measures)

Y is categorical with 2 classes

  • If the variable X is numerical
    • If the distribution of the X is normal* for the 2 classes or the sample size is over 30 for both classes: the test performed is the Welch’s T test. The Welch’s T-test is used because it is more robust to an imbalance of variances than the Student’s T-test, for an almost similar power[1].
    • Otherwise Mann-Whitney’s non parametric test
  • If the variable X is categorical
    • If the theoretical count in each cell of the contingency table is greater than 5: Chi Square test
    • Otherwise Fisher’s test
      • Exact test if X contains 2 classes
      • Otherwise : Fisher’s test whose p is obtained by a Monte-Carlo simulation with 100 000 iterations.
  • If we perform survival analyses, the test performed is a Log-rank test

Y is categorical with over 2 classes

  • If the variable X is numerical
    • If the distribution of X for all the classes is normal* or the sample size is over 30 for all the classes, and with homoskedaticity: ANOVA
    • Otherwise Kruskal-Wallis’ non parametric test
  • If the variable X is categorical
    • If the theoretical count in each cell of the contingency table is greater than 5: Chi Square test
    • Otherwise : Fisher’s test whose p is obtained by a Monte-Carlo simulation with 100 000 iterations.

Y is numerical

  • If the variable X is numerical
    • If the distribution of the variables Y and X is normal* or the sample size is over 30, and with homoskedaticity: Pearson’s correlation coefficient
    • Otherwise  Spearman’s correlation coefficiento Rho
  • If the variable X is categorical with 2 classes
    • If the distribution of the variable Y is normal* or the sample size is over 30 for both classes: the test performed is the Welch’s T test.
    • Otherwise Mann-Whitney’s non parametric test
  • If the variable X is categorical with over 2 classes
    • If the distribution of the variable Y is normal* or the sample size is over 30 for all the classes with an homoskedaticity: ANOVA
    • Otherwise Kruskal-Wallis’ non parametric  test

Paired tests (2 measures for the same patient)

X is categorical

  • McNemar’s test
    • If X contains 2 classes: McNemar’s exact test
    • If X contains over 2 classes: McNemar-Bowker’s test

X is numerical

  • If the distribution of the variable X is normal* or the sample size is over 30
  • for both measures: the test performed is the paired Welch’s T test
  • Otherwise the non parametric paired Mann-Whitney’s test

Multivariable statistical models

The choice of the multivariable model depends on the response variable Y:

  • If the variable Y is numerical, a linear regression model is performed
  • If the variable Y is categorical with two classes
    • if survival analysis, the model performed is the Cox model
    • otherwise, a logistic regression model is performed
  • If the variable Y is categorical with over two classes: no analysis possible

* pvalue.io considers a distribution to be normal when 1) the mean of the distribution is between the 40th and 60th percentiles and 2) the skewness value is less than 0.6.

 

    [1] 1.Welch, B. L. The generalisation of student’s problems when several different population variances are involved. Biometrika 34, 28–35 (1947).

No Comments

Post A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.