Vital Statistics You Never Learned…Because They’re Never Taught
By Kevin Gray and Frank Harrell
KG: Starting from the beginning, what is statistics and how did it come about? Could you give us a short definition and history of the discipline?
FH: That’s a loaded question and best answered by referring readers to the many writings of one of the greatest of our statistician historians Stephen Stigler (see for example https://en.wikipedia.org/wiki/History_of_statistics). In a brief nutshell statistics began as a way to understand the workings of states, productivity, life expectancy, agricultural yields, etc., and to make estimates of things from samples (an statistical example of the latter dates back to the 5th century BCE in Athens). Roughly speaking, statistics has developed into a few broad areas: descriptive (e.g., the usual baseball statistics), inferential (e.g., do baseball hitters have different success probabilities when playing at home?), estimative (e.g., from a factorial experiment what is the effect of changing baking temperature if we hold the amounts of flour and sugar constant) and predictive (e.g., financial forecasting or predicting how long a patient will go until a disease recurs).
Concerning a definition for statistics, it is a field that is a science unto itself and that benefits all other fields and everyday life. What is unique about statistics is its proven tools for decision making in the face of uncertainty, understanding sources of variation and bias, and most importantly, statistical thinking. Statistical thinking is a different way of thinking that is part detective, skeptical, and involves alternate takes on a problem. Statistics involves measurement refinement, experimental design, data analysis, inference, and interpretation of trends and evidence.
KG: What are the most fundamental things decision makers need to know about statistics in order to use it effectively for decision making?
FH: Always the most important issues are understanding the meaning and reliability of measurements and understanding the linkage between interpretation of the data and the experimental design. As data have gotten more voluminous, the average data analyst has gotten more relaxed about design, and we are seeing many data interpretation fiascos as a result (see https://youtu.be/TGGGDpb04Yc for a great example). When there is no design (as in casual data collection) or the design used is not consistent with the project goals (prospective vs. retrospective designs; randomized vs. observational designs, etc.), it is very seldom that a statistical analysis can come to the rescue. A famous quote by one of the founders of modern statistics R. A. Fisher nicely summarizes this issue: “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”
Regarding measurements, I see a lot of statisticians forgetting the adage “question everything” and trusting the client’s selection or computation of measurements. For example, nice continuous measurements are often categorized, resulting in great loss of information, power, precision, and generalizability. Or an investigator may derive a response variable using a normalization procedure that would best be modeled than used to create a ratio. Given a good design and appropriate measurements, the approach to statistical analysis needs to be based on good statistical principles as I attempted to overview here. Then the result needs to be actionable by estimating things on scales that are useful to the client (e.g., relative treatment effects, predicted risks, life expectancy, Bayesian posterior probabilities). A very common problem with the rise of machine learning (see below) is improper use of classification as opposed to prediction; classification makes too many assumptions for the client and does not provide a gray zone. I discuss this in detail here.
KG: What are the main areas or branches within statistics? How do they differ?
FH: There are three schools of thought: The most commonly used is frequentist statistics, which involves estimation, significance tests, confidence limits, and hypothesis testing in a scenario of imagined repetitions of a study (sampling distributions). Because of the need to consider sampling distributions and the “sample space,” frequentist methods can get quite complicated and require customized solutions for each sampling scheme, e.g., when one does sequential testing and wishes to stop when there is sufficient evidence for an effect. Statistical statements of frequentist results have been shown to be very difficult for non-statisticians (and some statisticians) to interpret.
Next comes the Bayesian school of statistics, which actually preceded the frequentist school by more than a century due to work of Bayes and Laplace. It was not used very much until powerful computers became available to statisticians. The Bayesian approach requires one to specify an anchor/starting point (“prior distribution”) which can require much thought but can also just specify a degree of skepticism to apply to the data. The benefits of going through this step are great – no need to create one-off solutions to complex design/sampling schemes, and Bayes provides directly actionable probabilities – probabilities that effects are positive, for example, as opposed the frequentist p-values which are probabilities of making assertions about effects being positive when in fact they are zero.
Finally, there is the likelihood school which is like the Bayesian school without prior distributions. Likelihood methods like Bayesian ones avoid the sample space so are much cleaner, but they mainly provide relative and not absolute evidence and don’t handle models containing a very large number of parameters. Besides the three schools there are different tools within in each school, especially within frequentism – such as the bootstrap, nonparametric methods, and missing data imputation methods.
KG: Are machine learning and data science different from statistics in your view?
FH: Yes. To oversimplify things, I would say that data science is applied statistics + computer science, with less attention to statistical theory and hypothesis testing, in favor of estimation and prediction. Machine learning is an extremely empirical way of doing statistical modeling, without caring very much about being able to separate effects of variables. Many machine learning practitioners are well grounded in statistics but many are not. The latter group seems to be constantly reinventing the wheel and using approaches that statistics has shown decades ago don’t work.
A hallmark of a good statistician is knowing how to quantify accuracy of estimates and predictions. The latter group of machine learning practitioners have never learned the principles and theory behind measures of predictive accuracy (including proper probability accuracy scores) and are constantly developing “classifiers” when predictions or optimal Bayes decisions were needed for the problem. These classifiers have a host of problems including failing to generalize to new samples with much different outcome frequencies, as discussed in more detail here. Machine learning practitioners also seem obsessed with feature selection and don’t realize that torturing data to attempt in vain to determine the “important” predictors is at odds with getting maximum information out of all the predictors, the latter having a lot to do with maximizing predictive discrimination.
KG: You question many common statistical practices and are frequently quite vocal in your criticisms. What are the most important things practitioners get wrong?
FH: First, I start with arithmetic. It is amazing how many people don’t know that you don’t add ratios unless they represent proportions for mutually exclusive events. In general ratios multiply. I see papers all that time that either analyzed ratios without taking the log or that analyzed percent change from baseline, failing to note that the math doesn’t work. Take for example a subject who starts at a value of 1.0 and increases to 2.0. This is a 100% increase. Then consider a subject starting at 2.0 who decreases to 1.0. This is a 50% decrease. The average of 100% and -50% is +25% whereas the two should cancel, arriving at an average of 0%. Percent change is an asymmetric measure and can’t be used in statistical analysis except under special restrictions.
Regarding the improper addition of ratios, many medical papers add odds ratios or hazard ratios when they should have added the logs of these. A simple example shows why. When developing a risk score suppose that two risk factors have regression coefficients of 1 and -1 in a logistic regression model. The two odds ratios are 2.72 and 0.37. Adding these pretends that both risk factors are harmful when in fact the second risk factor is protective. Change from baseline has a host of other problems as described in my blog. Instead we should be analyzing the raw response variable as a dependent variable, covariate-adjusting for the raw baseline variable. Statisticians and other data analysts need to carefully critique the math being used by their collaborators!
Turning to statistical models there are many common pitfalls, including trying to learn too much (using feature selection or estimating too many parameters) for what the sample size allows, resulting in overfitting/overinterpretation; making assumptions such as nonlinearity that are unlikely to be true; trying lots of transformations and pretending the final transformation was pre-specified, destroying statistical inferential properties of the result (as opposed to using spline functions); using improper accuracy scores; using classification instead of prediction in non-large signal:noise situations; trying different transformations on the dependent variable or being affected by outliers in that variable, as opposed to using robust semiparametric ordinal regression models. Then there is stepwise regression – don’t get me started …
Dichotomania is one of the greatest crimes against data. This is information-losing, arbitrary, and assumes discontinuous relationships not found in nature. It is virtually never a good idea to categorize a continuous dependent or independent variable.
There are so many other problems we see every day, including the use of ineffective graphics such as pie and bar charts.
KG: Finally, statistics is evolving very rapidly and new methods are continuously being developed. What do you think statistics will look like in 10-15 years?
FH: Whew – another difficult question! I am certain we will see Bayesian models used much more frequently as they give us the outputs we really need (forward-time, forward-information-flow probabilities) and allow us to formally incorporate external information, even if that information is, for example, just that it is impossible for a certain treatment to have an odds ratio greater than 10 against an outcome. We will also see more interpretable, flexible, and robust predictive methods, more intuitive and powerful statistical software and graphics, and in general more statistical methods that do not assume normality or rely on large sample theory.
Thank you, Frank!
Kevin Gray is president of Cannon Gray, a marketing science and analytics consultancy.
Frank Harrell is Professor of Biostatistics and Founding Chair of the Department of Biostatistics at the Vanderbilt University School of Medicine. He also works as an expert statistical advisor in the Office of Biostatistics, Center for Drug Evaluation and Research, FDA. He is the author of numerous publications, the influential book Regression Modeling Strategies and the R packages rms and Hmisc. He can be followed on his blog Statistical Thinking.