Fixing Python Pandas read_excel That Skips Rows When Header Is Not on Row One
You call pd.read_excel('report.xlsx'), glance at the DataFrame, and immediately notice something is off. The column names are strings like Unnamed: 0, or they are actual data values, or the first few real rows of data are simply gone. The file opened fine β Pandas just read the wrong part of it.
This happens because Pandas assumes the very first row of the sheet is the header. When the real header sits further down β because the file has a company logo, a report title, a date stamp, or blank separator rows above it β every row index is off from the start.
What you'll learn
- How the
headerparameter controls which row becomes your column names. - How to use
skiprowsto jump over preamble content before the header. - How to combine both parameters for messy, real-world layouts.
- How to handle files with no header, or with trailing summary rows you want to drop.
- Common mistakes that cause rows to disappear silently.
Prerequisites
- Python 3.8 or later.
- Pandas 1.3 or later (
pip install pandas openpyxl). - The
openpyxlengine installed β required for.xlsxfiles. - Enough familiarity with Pandas to read a DataFrame and inspect
.head().
Why Pandas defaults to row 0
Pandas maps Excel rows to zero-based Python indices internally. When you call pd.read_excel() without any extra arguments, header=0 is implied β meaning row index 0, which is the first visible row of the sheet, becomes the column header.
That assumption works perfectly for clean exports from databases or well-structured templates. It breaks the moment anyone adds a title block, a print area, or a few blank rows at the top of the sheet before the actual column labels. Pandas does not automatically detect what looks like a header; it just takes what you tell it to take.
Understanding this default is the key to fixing every variant of the problem.
Using the header parameter to point at the real header row
The header parameter accepts a zero-based row index. If your column labels are on the fifth row of the sheet (row 5 when you look at the Excel row numbers, row index 4 in Python), pass header=4.
import pandas as pd
# Excel sheet layout:
# Row 1: "Monthly Sales Report β Q3"
# Row 2: (blank)
# Row 3: "Generated: 2024-07-01"
# Row 4: (blank)
# Row 5: Region | Product | Units | Revenue <-- actual header
# Row 6+: data
df = pd.read_excel(
"sales_report.xlsx",
header=4, # 0-based: row 5 is index 4
engine="openpyxl",
)
print(df.head())
Everything above row index 4 is automatically discarded when you set header=4. Pandas treats that row as column names and starts pulling data from the row immediately after it.
To find the exact row index without guessing, open the file first with header=None and print a few rows:
raw = pd.read_excel("sales_report.xlsx", header=None, nrows=10)
print(raw)
Look at the output and count down from 0. That count is the number you pass to header.
Using skiprows to jump over preamble content
skiprows is subtly different from header. It physically removes rows from the input before Pandas even tries to find a header, which means after skipping, header=0 (the default) refers to whatever row is now first in the remaining data.
Use skiprows when you want to discard a set number of rows at the top and then treat the very next row as the header:
# Skip the first 4 rows entirely, then treat row 5 as the header
df = pd.read_excel(
"sales_report.xlsx",
skiprows=4,
engine="openpyxl",
)
print(df.columns.tolist())
You can also pass a list of specific row indices to skip when the rows you want to drop are not contiguous:
# Skip rows 0, 1, and 3 (indices), keep row 2 and row 4 onward
df = pd.read_excel(
"messy_report.xlsx",
skiprows=[0, 1, 3],
engine="openpyxl",
)
This is useful when a file has a title on row 0, a logo placeholder on row 1, a blank row 3, but a valid sub-header on row 2 you want to keep.
Combining header and skiprows for complex layouts
Some files mix both problems: there is junk at the top, and then the header is not the very first non-junk row. In that case, use both parameters together. skiprows runs first and removes rows from the raw input; then header picks the column row from whatever is left.
# Sheet layout:
# Row 0: logo/title
# Row 1: blank
# Row 2: subtitle block
# Row 3: blank
# Row 4: section label ("North America")
# Row 5: Region | Product | Units | Revenue <-- header
# Row 6+: data
# Skip the first 4 rows, then the new row 0 is "North America",
# new row 1 becomes the column header.
df = pd.read_excel(
"regional_report.xlsx",
skiprows=4,
header=1, # in the remaining rows, index 1 is the real header
engine="openpyxl",
)
Work through the math on paper before running it. After skiprows=4, the sheet's original row 4 becomes the new row 0. The original row 5 (your header) becomes the new row 1, so header=1 is correct.
Handling files with no header row at all
Sometimes the Excel file is a pure data dump with no column labels at all. If you use the default header=0, Pandas will silently steal your first data row and use it as column names.
Pass header=None to prevent this. Pandas will assign integer column names (0, 1, 2, β¦) and keep all rows as data:
df = pd.read_excel(
"raw_export.xlsx",
header=None,
engine="openpyxl",
)
# Assign your own column names afterward
df.columns = ["region", "product", "units", "revenue"]
This pattern is common when reading files generated by legacy systems that never include headers in their output.
Using skipfooter to drop trailing summary rows
Real-world Excel files often end with totals rows, signatures, or disclaimer text. These rows break aggregations if they get pulled in as data. The skipfooter parameter removes a specified number of rows from the end of the sheet before parsing.
# Sheet ends with:
# ... last data row ...
# Row N-2: "Total"
# Row N-1: "Prepared by Finance"
# Row N: blank
df = pd.read_excel(
"sales_report.xlsx",
header=4,
skipfooter=3, # drop the last 3 rows
engine="openpyxl",
)
Note: skipfooter only works with the python engine or openpyxl. If you are using the older xlrd engine, you will get a warning or error.
If you need to clean up export issues related to Excel file writing in general, the article on fixing Pandas to_excel that overwrites all sheets covers common write-side problems that produce similarly malformed files.
Reading only a specific range of rows with nrows
Sometimes you only need a slice of the sheet β for example, rows 6 through 106 out of a 10,000-row file. Use nrows to limit how many data rows (not counting the header) Pandas reads after it finds the header.
df = pd.read_excel(
"large_report.xlsx",
header=4,
nrows=100, # read exactly 100 data rows after the header
engine="openpyxl",
)
Combining skiprows, header, nrows, and skipfooter gives you precise control over exactly which rectangular slice of the spreadsheet becomes your DataFrame. If your columns still come out in the wrong order after doing this, check out the guide on fixing Pandas DataFrame that exports columns in the wrong order β that problem is distinct but often appears alongside header offset issues.
Common pitfalls and gotchas
Off-by-one errors
The single most common mistake is confusing Excel's 1-based row display with Python's 0-based index. If Excel shows your header on row 5, pass header=4 β not header=5. When in doubt, read with header=None, nrows=10 first and count from the printed output.
skiprows and header interact multiplicatively
After skiprows=N, the row indices reset. header=0 then refers to what was originally row N in the file. If you also pass header=M, Pandas counts M rows into the already-skipped data. Getting this wrong causes the header row to appear as a data row, or a data row to appear as the header. Print intermediate results to verify.
Merged cells in the header area
Excel files that use merged cells across the header row often produce Unnamed: X columns for every cell after the first in the merged group. Pandas cannot unmerge cells β it only sees the value in the first cell. You will need to rename those columns manually after loading, or pre-process the file with openpyxl. The article on fixing openpyxl merged cells that lose data goes deeper on this specific problem.
Multi-row headers
Some reports use two header rows β a category row on top and a sub-label row below it. Pass a list to header to create a MultiIndex:
# Rows 4 and 5 together form a two-level header
df = pd.read_excel(
"multi_header_report.xlsx",
header=[4, 5],
engine="openpyxl",
)
print(df.columns) # MultiIndex
MultiIndex column names require slightly different access syntax (df[('Category', 'Sub-label')]), so flatten them if your downstream code expects simple string columns.
The engine matters
xlrd only reads the older .xls format. For any .xlsx file, always specify engine="openpyxl". Forgetting this can produce cryptic parse errors that have nothing to do with your row-skipping logic.
Blank rows inside the data block
Sometimes the preamble includes blank rows, but so does the data itself (used as visual separators inside the sheet). skiprows with a list lets you target specific rows, but you may also want to call df.dropna(how='all') after loading to remove any blank data rows that slipped through.
If you run into unexpected NaN values after all of this, the guide on fixing Pandas pivot_table that returns NaN instead of zero covers how NaN propagates and how to handle it in aggregation contexts.
Wrapping up
The core rule is simple: tell Pandas exactly where the header lives and what to ignore. Here are the concrete steps to apply when read_excel gives you wrong columns or missing rows:
- Inspect the raw file first. Load with
header=None, nrows=15and print the result. Count the zero-based row index of your actual header row. - Set
header=Nto that index. This is the simplest fix for most cases. - Use
skiprowswhen the preamble rows vary per file, or when you want to remove non-contiguous rows by passing a list of indices. - Add
skipfooter=Nif the sheet ends with totals, signatures, or blank rows you do not want in your data. - Rename or flatten columns after loading if merged cells or multi-row headers left you with
Unnamedcolumn names or a MultiIndex you do not need.
Frequently Asked Questions
Why does pandas read_excel turn my first data row into column names?
Pandas defaults to header=0, meaning it always treats the first row of the sheet as column names. If your real data starts on row 1 but you have no explicit header, pass header=None to prevent Pandas from consuming that row.
How do I tell pandas read_excel to start reading from row 5 in Excel?
Row 5 in Excel is zero-based index 4 in Python. Pass header=4 to pd.read_excel() and Pandas will use that row as the column header and read data from row 6 onward, discarding everything above.
What is the difference between skiprows and header in pandas read_excel?
skiprows removes rows from the raw input before any parsing happens, then header picks the column row from what remains. You can use skiprows alone when the header immediately follows the skipped rows, or combine both when more complex layouts require it.
How can I drop the last few rows of an Excel sheet when reading it with pandas?
Use the skipfooter parameter: pd.read_excel('file.xlsx', skipfooter=3) removes the last 3 rows before building the DataFrame. This is useful for sheets that end with totals rows or disclaimer text.
Can pandas read_excel handle a spreadsheet with two header rows?
Yes. Pass a list to the header parameter, for example header=[0, 1], and Pandas creates a MultiIndex with one level per header row. You can then call df.columns = ['_'.join(col) for col in df.columns] to flatten it into simple string column names.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!