Fixing Pandas read_excel Silently Skipping Rows When Header Row Is Not First
You load an Excel file with pd.read_excel(), glance at your DataFrame, and something is off. Column names are wrong, rows are missing, or the first few rows are garbage data you never asked for. The function ran without raising a single warning.
This is one of the most common silent failures in pandas. The default behavior assumes your header lives on the very first row, and when it doesn't, the results look plausible enough that you might not catch the problem until your analysis is already wrong.
What you'll learn
- Why
read_excelsilently misbehaves when the header row isn't row zero - How the
header,skiprows, andnrowsparameters interact - How to inspect an Excel file before committing to parameters
- Patterns for handling multi-row headers and metadata blocks
- Common pitfalls and how to avoid them
Prerequisites
You need Python 3.8 or later, pandas installed, and either openpyxl (for .xlsx) or xlrd (for legacy .xls) as the engine. If you're not sure which you have, run pip show openpyxl xlrd in your terminal.
Why the Default Breaks on Real-World Excel Files
Pandas defaults to header=0, meaning it treats the very first row (zero-indexed) as your column names. Every row above that position is silently discarded, and every row below becomes data.
In practice, Excel files from business users, legacy exports, or report generators almost never put the header on row 1. You often get a logo block, a report title, a date stamp, or a blank row before the actual column names appear. Pandas doesn't know this, so it promotes whatever happens to be in row 0 to column names and treats real headers further down as data rows.
The failure is silent because no exception is raised. You get a valid DataFrame β just the wrong one.
Reproducing the Problem
Imagine an Excel sheet that looks like this:
| Row (1-indexed) | Content |
|---|---|
| 1 | Monthly Sales Report β Q2 |
| 2 | (blank) |
| 3 | Region, Product, Units, Revenue |
| 4 | North, Widget A, 120, 4800 |
| 5 | South, Widget B, 95, 3800 |
If you run the naive call, you get a mess:
import pandas as pd
df = pd.read_excel("sales_q2.xlsx")
print(df.head())
The output will use the report title cell as the first column name, and your real header row will appear as a data row. You won't see an error, just wrong data.
The header Parameter: Your First Fix
The header parameter tells pandas which row index (zero-based) contains your column names. To fix the example above, the header is on row 3 in Excel terms, which is index 2 in zero-based terms:
df = pd.read_excel("sales_q2.xlsx", header=2)
print(df.head())
This correctly assigns Region, Product, Units, and Revenue as column names. Everything above row index 2 is automatically discarded.
One important note: header is zero-indexed relative to the sheet, not relative to any rows you skip with skiprows. This distinction causes a lot of confusion, so keep it clearly in mind.
The skiprows Parameter: When You Need More Control
The skiprows parameter is for skipping specific rows before parsing begins. It accepts an integer, a list of integers, or a callable.
Skipping a fixed number of rows at the top
If you want to skip the first two rows entirely and then treat the next row as the header, you can combine skiprows with the default header=0:
# Skip the first 2 rows, then treat row 3 as the header
df = pd.read_excel("sales_q2.xlsx", skiprows=2, header=0)
This is functionally equivalent to header=2 in this case, but the two parameters interact differently when you need to skip non-contiguous rows.
Skipping specific row indices
If your file has scattered blank or metadata rows mixed into the data block, pass a list:
# Skip rows at index 0, 1, and 3 (for example, a blank row between header and data)
df = pd.read_excel("sales_q2.xlsx", skiprows=[0, 1, 3])
Use this carefully. Skipping non-contiguous rows with a list re-indexes the remaining rows before the header is resolved, which can cause unexpected behavior if you're also setting header to a specific value.
Using a callable for conditional skipping
For more complex cases, pass a lambda or function. It receives the row index (zero-based) and should return True to skip a row:
# Skip any row whose index is less than 2 (i.e., skip first two rows)
df = pd.read_excel("sales_q2.xlsx", skiprows=lambda x: x < 2)
This is cleaner than a list when you're skipping a range and makes the intent obvious to anyone reading the code later.
Inspecting the File Before You Commit to Parameters
Guessing the header row is fragile. A better approach is to read the first several rows raw, inspect them, and then decide on your parameters.
import pandas as pd
# Read without a header to see the raw structure
raw = pd.read_excel("sales_q2.xlsx", header=None, nrows=10)
print(raw)
Setting header=None tells pandas not to treat any row as column names β every row comes through as data, and columns are numbered 0, 1, 2, etc. Combined with nrows=10, this gives you a clean preview of the top of the file without loading the whole thing.
Once you can see the raw row indices, picking the right header value is obvious rather than trial-and-error.
Handling Multi-Row Headers
Some Excel exports use two or three rows to describe column names, with a category label spanning several columns on the first row and more specific names on the second. Pandas handles this with a list passed to header:
# Rows at index 2 and 3 together form a multi-level header
df = pd.read_excel("sales_q2.xlsx", header=[2, 3])
print(df.columns)
This produces a MultiIndex for the columns. Accessing data then looks like df["Region"]["North"] depending on how the levels are structured. If you don't need the hierarchy, you can flatten it after loading:
df.columns = [" ".join(col).strip() for col in df.columns.values]
That collapses the multi-level column names into a single string per column, which is usually easier to work with downstream.
Limiting Rows with nrows
The nrows parameter controls how many data rows are read after the header. It does not count the header row itself, and it does not count any rows skipped by skiprows. This is useful when you're testing parameters or when your Excel file has a footer block you want to exclude:
# Read only 100 data rows after the header
df = pd.read_excel("sales_q2.xlsx", header=2, nrows=100)
Combine header, skiprows, and nrows to carve out exactly the rectangle of data you need from any messy spreadsheet.
Common Pitfalls
Confusing 1-indexed Excel rows with 0-indexed pandas rows
Excel shows row numbers starting at 1. Pandas indexes rows starting at 0. If your header is on Excel row 4, pass header=3 to pandas. Writing header=4 is the single most common mistake here.
Using skiprows and header together without understanding the interaction
When you pass skiprows=2, pandas removes those rows from the data stream entirely before it looks at header. So header=0 with skiprows=2 points to what was originally the third row in the file, not the first. This is correct behavior, but it's easy to double-skip if you're not thinking clearly about the sequence.
Assuming the sheet name or index
By default, read_excel reads the first sheet. If your header is actually on the second sheet, add sheet_name="Sheet2" or sheet_name=1. Many debugging sessions end here.
Merged cells in the header row
Excel lets you merge cells, but pandas doesn't understand merged cells natively. A merged cell that spans columns B and C will populate column B with the value and leave column C as NaN in pandas. You'll need to forward-fill those column names manually if you're working with merged headers:
raw_header = pd.read_excel("sales_q2.xlsx", header=None, nrows=1)
filled_header = raw_header.ffill(axis=1)
Reading .xls files with the wrong engine
If you're working with older .xls files, openpyxl won't read them. You need xlrd and must pass engine="xlrd" explicitly. Failing to do this raises a clear error, but it's worth knowing upfront so you install the right dependency.
Putting It Together: A Robust Reader Function
Once you've inspected your file and know the structure, wrapping the call in a small utility function makes the intent clear and reusable:
import pandas as pd
def load_report(path: str, header_row: int, data_rows: int = None) -> pd.DataFrame:
"""
Load a report-style Excel file where the header is not on row 0.
Args:
path: Path to the .xlsx file.
header_row: Zero-based index of the row containing column names.
data_rows: Optional limit on how many data rows to read.
Returns:
A clean DataFrame with correct column names.
"""
return pd.read_excel(
path,
header=header_row,
nrows=data_rows,
engine="openpyxl",
)
df = load_report("sales_q2.xlsx", header_row=2)
print(df.head())
This is easier to audit than a bare read_excel call with multiple keyword arguments scattered through a notebook, and it gives your future self (or a teammate) a clear interface to adjust.
Wrapping Up
Silent row-skipping in read_excel is almost always a mismatch between where pandas expects the header and where it actually lives. The fix is straightforward once you know the right parameters.
Here are the concrete next steps to take:
- Run a raw preview with
header=None, nrows=10on any unfamiliar Excel file before assuming parameters. - Use
header=N(zero-indexed) to point pandas at the correct column name row directly. - Use
skiprowsfor non-header rows you want gone entirely, and be deliberate about how it interacts withheader. - Check for merged cells in header rows and forward-fill column names before relying on them.
- Wrap your
read_excelcall in a small function with explicit parameters so the structure is documented and easy to adjust.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!