Chapter 1: The Golem of Prague
1.1: Statistical Golems
- Scientists frequently construct and use “statistical golems” - tools that know their own procedure, but have no inherent wisdom.
- p-values, stat tests, etc
- Classical statistical tools may not be adaptable to all modern research scenarios.
- The golems are often associative rather than causal, leading to misuse.
1.2: Statistical Rethinking
- The classic approach is to use a “flow-chart” to determine the statistical procedure to use. However, this can lead to battles over picking the “correct” test.
- We also face epistemological issues, realized as conflation between causation and hypothesis falsification.
- Karl Popper’s The Myth of the Framework (1996), his last book, covers this topic.
- McElreath posits that “deductive falsification is impossible”, given that:
- Hypotheses are not models. So, falsifying the hypothesis does not falsify the model.
- Measurement can invalidate models. If the measurement is wrong, falsification is not possible.
- Note that null hypothesis significance testing (NHST) attempts to falsify the null hypothesis, not the actual research hypothesis.
1.2.1: Hypotheses are not models
Hypotheses are not models; rather they are one of three pieces that constitute scientific understanding:
- Hypotheses are falsifiable, often vague descriptions of evidence.
- Process models are non-statistical understandings of a mechanism which supports or refutes a hypothesis.
- Statistical models are statistical understandings of process models.
These pieces have special properties: - Since hypotheses are vague, they map to multiple process models. - Effectively similarly, statistical models can map to multiple process models. - Therefore, any statistical model can support/refute multiple hypotheses.
This conundrum implies that if two models imply similar data, you should search for a description of the data under which the processes look different. Two process models can make similar predictions for field X, but vastly different predictions for Y.
1.2.2: Measurement matters
Two properties of statistical modeling complicate hypothesis falsification:
- Observation error
- Continuous hypotheses
1.2.2.1: Observation error
Difficulty of observation often leads to observed data mapping to multiple hypothetical conclusions. Or, the data can simply be measured incorrectly. Both instances can result in “spurious falsifications,” where the conclusions drawn are due to chance.
1.2.2.2: Continuous hypotheses
Continuous hypotheses, such as “80% of swans are white” are difficult to prove, since Modus tollens (“If P, then Q. Not Q. So, not P.”) does not apply.
1.2.3: Falsification is consensual
In the scientific community, falsification of hypotheses is not strictly logical; rather, agreements on evidence are reached via consensus. Consequently, claims of science’s definitiveness are usually exaggerated and possibly societally harmful. Kitcher (2011), Science in a Democratic Society, is a good intro to the sociology of science.
1.3: Tools for golem engineering
Scientific research is often broader in scope than testing alone; models fill this gap. In addition to serving testing, models can make predictions and communicate understandings.
Rethinking covers four tools for modeling:
- Bayesian data analysis
- Model comparison
- Multilevel models
- Graphical causal models
1.3.1: Bayesian data analysis
We just use randomness to describe our uncertainty…
Bayesian data analysis is a highly generalizable probability-based tool for using data to learn about the world.
Bayesian analysis is intuitive:
- Count* the number of possible explanations.
- Identify the most plausible explanations based on counts.
*In Bayesian, “count” refers to calculus, since Bayesian is a practice of probability distributions.
The frequentist approach, in contrast, imagines an infinite resampling of data to reach a probability distribution. Ironically, however, frequentist tests are often interpreted as Bayesian components (“95% likely that the value falls inside this confidence interval”, e.g).
Further, in the frequentist approach, parameters are not random variables. Random variables are useful when the measurement leads to a uniform sampling distribution. In the case of image analysis, Bayesian analysis is often used, allowing noisy images to be reconstructed using modeled probabilities.
1.3.2: Model comparison and prediction
Fitting is easy; prediction is hard.
Cross-validation and information criteria are two metrics used to estimate predictive accuracy and evaluate possibility of overfitting:
- Cross-validation provides provide fit estimates over different resamples of the data.
- Information criteria are measures that “balance model fit with model complexity”.
Recall that multiple process models can exist for a statistical model. Thus, it is always necessary to compare multiple statistical models for a given problem.
1.3.3: Multilevel models
Multilevel models are hierarchical Bayesian models that utilize partial pooling, a statistical technique that uses information across levels of the hierarchy to produce better estimates for units within a level.
Four common applications of partial pooling:
- Adjust for repeat sampling (multiple observations from the same unit)
- Adjust for imbalanced sampling (observation X sampled 10x more than Y)
- Study variation (useful when question includes variations among a level within the hierarchy)
- Avoid averaging (features are often pre-averaged for traditional models, destroying the data’s variation)
McElreath argues that “multilevel regression deserves to be the default form of regression,” and that research involving single-level models must justify the decision to not use a multilevel model. Particularly, the researcher must demonstrate lack of variation in treatment effects among observations.
Multilevel models are not new! From 1960 to 1978 John Tukey developed multilevel models for NBC from that were used for election forecasting.
1.3.4: Graphical causal models
A statistical model is an amazing association engine
Flawed causal models and confounding variables can often lead to better prediction, due to association’s predictive power. This often means that scientists can be systematically misled by predictive accuracy.
Thus, being mindful of the underlying causal assumptions is a mandatory component of statistical modeling. This is most often achieved through construction of a causal model, from which multiple statistical models can be developed.
Graphical causal models, such as directed acyclic graphs (DAGs), are helpful tools allowing for the visualization of causal relationships. DAGs are developed outside of the data, usually in deep consultation with domain experts.
Today’s dominating method of causal inference is the causal salad, wherein control variables are carefully tossed into the models, toward the aim of developing a causal narrative. When designing a model to be later used for interventions, it is important that the causal relationships, especially the controlling effects, are explainable.
1.4: Summary
McElreath makes the argument that instead of working with a plethora of black-box models, we should aim to develop the skills to analyze null hypotheses. The tools introduced – Bayesian data analysis, model comparison, multilevel modeling, and graphical causal models – serve this aim.
Chapter Outline
- Chapters 2-3: Intro to Bayesian inference
- Chapters 4-8: Multiple linear regression
- Chapters 9-12: Generalized linear models
- Chapters 13-16: Multilevel models
- Chapter 17: Address issues from 1st edition
Each chapter ends with exercises across difficulty levels. The more difficult problems introduce new material and challenges.