Why Your Stratified Split Still Produces Unrepresentative Test Sets
You ran train_test_split(..., stratify=y), your class distribution looks identical across train and test, and you're confident your evaluation is solid. Then the model ships and performance drops noticeably. The stratified split didn't lie to you β it just told you a very narrow slice of the truth.
Stratification only guarantees one thing: the proportion of each target class is preserved. Everything else β feature distributions, temporal order, group membership, rare subpopulations β is left to chance. That gap between what stratification promises and what evaluation actually requires is where most silent failures live.
What you'll learn
- Why class-ratio preservation is necessary but not sufficient for a representative test set
- How covariate shift, temporal leakage, and group leakage sneak past stratification
- How to detect distributional gaps between your train and test splits
- Practical alternatives and additions to standard stratified splitting
- What to do when your dataset is too small to fix this cleanly
Prerequisites
This article assumes you're comfortable with scikit-learn's splitting utilities and have a working knowledge of pandas. Code examples use Python 3.10+ and scikit-learn 1.3+. A basic understanding of cross-validation will help for the later sections.
The One Promise Stratification Actually Makes
Stratified splitting samples rows so that each split mirrors the original class balance. If 15% of your data is class 1, both your train and test sets will be approximately 15% class 1. That's the entire guarantee.
This matters because a naive random split on an imbalanced dataset can, by bad luck, give you a test set with almost no minority-class examples. Stratification fixes that specific problem reliably. But the moment you confuse
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!