Flattening Nested JSON into a Pandas DataFrame Without Losing Data
You hit an API, load the response, and call pd.DataFrame(data). Half your columns contain dictionaries. Another column is a list of objects. One field is just missing from every third record. Sound familiar?
Nested JSON is the rule, not the exception, when working with real-world APIs. The default Pandas constructor won't save you here β it silently packs nested structures into object columns and moves on. Getting that data into a flat, analyzable shape takes a few deliberate steps.
What you'll learn
- How to use
json_normalizeto flatten one or more levels of nesting - How to handle arrays (lists of objects) nested inside records
- How to deal with missing or inconsistent keys across records
- How to write a recursive flattener for arbitrarily deep structures
- Common mistakes that silently drop data and how to avoid them
Prerequisites
You need Python 3.8 or later, Pandas 1.3 or later, and a basic familiarity with DataFrames. Install Pandas with pip install pandas if you haven't already. All examples use the standard library's json module β no extra dependencies.
Why pd.DataFrame() Isn't Enough
When Pandas encounters a dictionary value inside a list of records, it stores the whole dictionary as a single cell value. That cell has dtype object, and you cannot filter, group, or aggregate on it without extra work.
Consider this structure:
import json
data = [
{
"id": 1,
"name": "Alice",
"address": {"city": "Berlin", "zip": "10115"},
"scores": [88, 92, 76]
},
{
"id": 2,
"name": "Bob",
"address": {"city": "Hamburg", "zip": "20095"},
"scores": [70, 85]
}
]
import pandas as pd
df = pd.DataFrame(data)
print(df.dtypes)
The address column will be dtype object, storing raw dicts. The scores column stores raw lists. Neither is queryable in any meaningful way until you unpack them.
Using json_normalize for Single-Level Nesting
pd.json_normalize is the right starting point. It walks one level of nested keys and promotes them to top-level columns using a dot-separated naming convention.
from pandas import json_normalize
df = json_normalize(data)
print(df.columns.tolist())
# ['id', 'name', 'scores', 'address.city', 'address.zip']
The address dictionary is now split into address.city and address.zip. The scores list is untouched β we'll get to that shortly.
You can change the separator if dots conflict with your column naming conventions:
df = json_normalize(data, sep="_")
# columns: 'address_city', 'address_zip'
Flattening Nested Arrays with the record_path Argument
When a field contains a list of objects β not just scalars β you need the record_path argument. This tells json_normalize to treat that list as the primary row source and pull parent-level fields alongside it using meta.
Here's a more realistic API payload:
orders = [
{
"order_id": "A001",
"customer": "Alice",
"items": [
{"sku": "X1", "qty": 2, "price": 9.99},
{"sku": "X2", "qty": 1, "price": 4.49}
]
},
{
"order_id": "A002",
"customer": "Bob",
"items": [
{"sku": "X3", "qty": 5, "price": 2.00}
]
}
]
df = json_normalize(
orders,
record_path="items",
meta=["order_id", "customer"]
)
print(df)
Each item becomes its own row, and the parent fields order_id and customer are repeated alongside every item. This is the correct relational shape for analysis.
Handling nested meta fields
If the fields you want in meta are themselves nested, pass them as a list of lists:
df = json_normalize(
orders,
record_path="items",
meta=[["shipping", "city"], "order_id"]
)
This pulls data[n]["shipping"]["city"] for each record and names the resulting column shipping.city by default.
Exploding Lists of Scalars
Not all list fields contain objects. Sometimes a field is just a list of values β tags, scores, category IDs. json_normalize leaves those alone. Use DataFrame.explode() to expand them into individual rows.
df = json_normalize(data) # 'scores' still holds lists
df_exploded = df.explode("scores").reset_index(drop=True)
print(df_exploded)
Each score now occupies its own row, with the parent record's other fields repeated. If you only want to analyze the list without duplicating the parent row, consider keeping the exploded frame separate and joining on a key.
Dealing with Missing and Inconsistent Keys
Real API responses are rarely uniform. One record has an address field; the next doesn't. One item has a discount key; most don't. By default, json_normalize fills missing keys with NaN, which is usually what you want. But there are edge cases.
errors='ignore' for unstable meta fields
If a meta field is absent from some records, json_normalize raises a KeyError by default. Pass errors='ignore' to substitute NaN instead:
df = json_normalize(
orders,
record_path="items",
meta=["order_id", "promo_code"],
errors="ignore"
)
Pre-filling missing keys
If you need a more explicit default, normalize the records before passing them to Pandas:
default_fields = {"promo_code": None, "shipping": {}}
normalized = [{**default_fields, **record} for record in orders]
df = json_normalize(normalized, record_path="items", meta=["order_id", "promo_code"])
Dictionary unpacking ensures every record has the expected keys before Pandas touches it.
Writing a Recursive Flattener for Arbitrary Depth
json_normalize only descends into dictionaries, not into lists of dictionaries nested beyond the first level, and it won't handle structures that are four or five levels deep in unusual ways. For those cases, write a recursive flattener.
def flatten_dict(d, parent_key="", sep="_"):
items = []
for k, v in d.items():
new_key = f"{parent_key}{sep}{k}" if parent_key else k
if isinstance(v, dict):
items.extend(flatten_dict(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
flat_records = [flatten_dict(record) for record in data]
df = pd.DataFrame(flat_records)
This walks every dictionary key recursively and builds a flat key name by joining levels with the separator. Lists are left as-is β you handle those with explode() afterward.
Extending the flattener to handle lists of dicts
If you also want to inline small lists of dicts (for example, a field that always has exactly two elements), you can extend the function:
def flatten_dict(d, parent_key="", sep="_"):
items = []
for k, v in d.items():
new_key = f"{parent_key}{sep}{k}" if parent_key else k
if isinstance(v, dict):
items.extend(flatten_dict(v, new_key, sep=sep).items())
elif isinstance(v, list) and v and isinstance(v[0], dict):
for i, item in enumerate(v):
items.extend(flatten_dict(item, f"{new_key}_{i}", sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
Be cautious with this approach on variable-length lists β it produces a different set of columns per record if list lengths vary, which Pandas will fill with NaN. Use it only when the structure is stable.
Common Pitfalls That Silently Drop Data
Calling pd.DataFrame() directly on nested data
This is the most common mistake. It looks like it worked because you get a DataFrame with the right number of rows β but nested columns are object dtype and contain raw Python structures, not values you can operate on.
Forgetting record_path when the key payload is inside an array
If the API returns {"results": [{...}, {...}]} and you pass the whole response to json_normalize without specifying record_path="results", you get a one-row DataFrame with a list in the results column. Always check the top-level structure of your JSON first.
Using explode on a column that's not consistently a list
If some rows have a scalar and others have a list in the same column, explode will leave the scalar rows untouched (which is usually correct in Pandas 1.3+), but you should verify the output. Use df[col].apply(type).value_counts() to audit a column's contents before exploding.
Ignoring the index after explode
After explode(), the index reflects the original row positions, so multiple new rows share the same index label. Always call .reset_index(drop=True) immediately after unless you have a reason to keep the original index.
Assuming json_normalize descends into lists automatically
json_normalize only expands dictionaries at the top level. Nested lists of objects require explicit record_path arguments. If you see an object-dtype column after normalization that you expected to be flat, it's almost always because a list-of-dicts was left untouched.
Putting It All Together: A Full Example
Here's a complete workflow for a realistic multi-level API payload β normalize the outer structure, explode the nested array, and rename columns for clarity.
import pandas as pd
from pandas import json_normalize
response = [
{
"user_id": 101,
"profile": {"name": "Alice", "tier": "gold"},
"purchases": [
{"item": "Widget A", "amount": 29.99, "tags": ["sale", "new"]},
{"item": "Widget B", "amount": 14.99, "tags": ["new"]}
]
},
{
"user_id": 102,
"profile": {"name": "Bob", "tier": "silver"},
"purchases": [
{"item": "Gadget X", "amount": 9.99, "tags": []}
]
}
]
# Step 1: flatten purchases array with parent meta
df = json_normalize(
response,
record_path="purchases",
meta=["user_id", ["profile", "name"], ["profile", "tier"]],
errors="ignore",
sep="_"
)
# Step 2: explode the tags list into individual rows
df_tags = df.explode("tags").reset_index(drop=True)
# Step 3: rename columns for readability
df_tags.rename(columns={
"profile_name": "user_name",
"profile_tier": "user_tier"
}, inplace=True)
print(df_tags)
The final DataFrame has one row per purchase-tag combination, with user metadata repeated alongside each row. Every field from the original payload is accounted for.
Next Steps
You now have a solid toolkit for turning messy nested JSON into a flat, query-ready DataFrame. Here are four concrete things to do next:
- Audit your API responses first. Print
json.dumps(response[0], indent=2)on the first record before writing any normalization code. Knowing the exact shape saves debugging time. - Write a utility function. Wrap your specific normalization + explode chain into a reusable function with the schema documented. Future-you will appreciate it when the API changes.
- Validate dtypes after loading. Call
df.dtypesand check that no expected numeric column landed asobject. If it did, you likely have a remaining nested structure somewhere. - Use
pd.json_normalizewith large files carefully. For payloads with tens of thousands of records and deep nesting, load withijsonfor streaming rather than reading the entire file into memory first. - Explore PandaSQL or direct SQL alternatives if your JSON eventually lands in a database β querying structured data in SQL is often faster than DataFrame gymnastics once the data is stored.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!