Flattening Nested JSON into a Pandas DataFrame Without Losing Data

May 13, 2026 7 min read 6 views
Abstract flat illustration of nested JSON structures expanding into organized spreadsheet rows against a soft gradient background

You hit an API, load the response, and call pd.DataFrame(data). Half your columns contain dictionaries. Another column is a list of objects. One field is just missing from every third record. Sound familiar?

Nested JSON is the rule, not the exception, when working with real-world APIs. The default Pandas constructor won't save you here β€” it silently packs nested structures into object columns and moves on. Getting that data into a flat, analyzable shape takes a few deliberate steps.

What you'll learn

  • How to use json_normalize to flatten one or more levels of nesting
  • How to handle arrays (lists of objects) nested inside records
  • How to deal with missing or inconsistent keys across records
  • How to write a recursive flattener for arbitrarily deep structures
  • Common mistakes that silently drop data and how to avoid them

Prerequisites

You need Python 3.8 or later, Pandas 1.3 or later, and a basic familiarity with DataFrames. Install Pandas with pip install pandas if you haven't already. All examples use the standard library's json module β€” no extra dependencies.

Why pd.DataFrame() Isn't Enough

When Pandas encounters a dictionary value inside a list of records, it stores the whole dictionary as a single cell value. That cell has dtype object, and you cannot filter, group, or aggregate on it without extra work.

Consider this structure:

import json

data = [
    {
        "id": 1,
        "name": "Alice",
        "address": {"city": "Berlin", "zip": "10115"},
        "scores": [88, 92, 76]
    },
    {
        "id": 2,
        "name": "Bob",
        "address": {"city": "Hamburg", "zip": "20095"},
        "scores": [70, 85]
    }
]

import pandas as pd
df = pd.DataFrame(data)
print(df.dtypes)

The address column will be dtype object, storing raw dicts. The scores column stores raw lists. Neither is queryable in any meaningful way until you unpack them.

Using json_normalize for Single-Level Nesting

pd.json_normalize is the right starting point. It walks one level of nested keys and promotes them to top-level columns using a dot-separated naming convention.

from pandas import json_normalize

df = json_normalize(data)
print(df.columns.tolist())
# ['id', 'name', 'scores', 'address.city', 'address.zip']

The address dictionary is now split into address.city and address.zip. The scores list is untouched β€” we'll get to that shortly.

You can change the separator if dots conflict with your column naming conventions:

df = json_normalize(data, sep="_")
# columns: 'address_city', 'address_zip'

Flattening Nested Arrays with the record_path Argument

When a field contains a list of objects β€” not just scalars β€” you need the record_path argument. This tells json_normalize to treat that list as the primary row source and pull parent-level fields alongside it using meta.

Here's a more realistic API payload:

orders = [
    {
        "order_id": "A001",
        "customer": "Alice",
        "items": [
            {"sku": "X1", "qty": 2, "price": 9.99},
            {"sku": "X2", "qty": 1, "price": 4.49}
        ]
    },
    {
        "order_id": "A002",
        "customer": "Bob",
        "items": [
            {"sku": "X3", "qty": 5, "price": 2.00}
        ]
    }
]

df = json_normalize(
    orders,
    record_path="items",
    meta=["order_id", "customer"]
)
print(df)

Each item becomes its own row, and the parent fields order_id and customer are repeated alongside every item. This is the correct relational shape for analysis.

Handling nested meta fields

If the fields you want in meta are themselves nested, pass them as a list of lists:

df = json_normalize(
    orders,
    record_path="items",
    meta=[["shipping", "city"], "order_id"]
)

This pulls data[n]["shipping"]["city"] for each record and names the resulting column shipping.city by default.

Exploding Lists of Scalars

Not all list fields contain objects. Sometimes a field is just a list of values β€” tags, scores, category IDs. json_normalize leaves those alone. Use DataFrame.explode() to expand them into individual rows.

df = json_normalize(data)  # 'scores' still holds lists
df_exploded = df.explode("scores").reset_index(drop=True)
print(df_exploded)

Each score now occupies its own row, with the parent record's other fields repeated. If you only want to analyze the list without duplicating the parent row, consider keeping the exploded frame separate and joining on a key.

Dealing with Missing and Inconsistent Keys

Real API responses are rarely uniform. One record has an address field; the next doesn't. One item has a discount key; most don't. By default, json_normalize fills missing keys with NaN, which is usually what you want. But there are edge cases.

errors='ignore' for unstable meta fields

If a meta field is absent from some records, json_normalize raises a KeyError by default. Pass errors='ignore' to substitute NaN instead:

df = json_normalize(
    orders,
    record_path="items",
    meta=["order_id", "promo_code"],
    errors="ignore"
)

Pre-filling missing keys

If you need a more explicit default, normalize the records before passing them to Pandas:

default_fields = {"promo_code": None, "shipping": {}}
normalized = [{**default_fields, **record} for record in orders]
df = json_normalize(normalized, record_path="items", meta=["order_id", "promo_code"])

Dictionary unpacking ensures every record has the expected keys before Pandas touches it.

Writing a Recursive Flattener for Arbitrary Depth

json_normalize only descends into dictionaries, not into lists of dictionaries nested beyond the first level, and it won't handle structures that are four or five levels deep in unusual ways. For those cases, write a recursive flattener.

def flatten_dict(d, parent_key="", sep="_"):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_dict(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

flat_records = [flatten_dict(record) for record in data]
df = pd.DataFrame(flat_records)

This walks every dictionary key recursively and builds a flat key name by joining levels with the separator. Lists are left as-is β€” you handle those with explode() afterward.

Extending the flattener to handle lists of dicts

If you also want to inline small lists of dicts (for example, a field that always has exactly two elements), you can extend the function:

def flatten_dict(d, parent_key="", sep="_"):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_dict(v, new_key, sep=sep).items())
        elif isinstance(v, list) and v and isinstance(v[0], dict):
            for i, item in enumerate(v):
                items.extend(flatten_dict(item, f"{new_key}_{i}", sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

Be cautious with this approach on variable-length lists β€” it produces a different set of columns per record if list lengths vary, which Pandas will fill with NaN. Use it only when the structure is stable.

Common Pitfalls That Silently Drop Data

Calling pd.DataFrame() directly on nested data

This is the most common mistake. It looks like it worked because you get a DataFrame with the right number of rows β€” but nested columns are object dtype and contain raw Python structures, not values you can operate on.

Forgetting record_path when the key payload is inside an array

If the API returns {"results": [{...}, {...}]} and you pass the whole response to json_normalize without specifying record_path="results", you get a one-row DataFrame with a list in the results column. Always check the top-level structure of your JSON first.

Using explode on a column that's not consistently a list

If some rows have a scalar and others have a list in the same column, explode will leave the scalar rows untouched (which is usually correct in Pandas 1.3+), but you should verify the output. Use df[col].apply(type).value_counts() to audit a column's contents before exploding.

Ignoring the index after explode

After explode(), the index reflects the original row positions, so multiple new rows share the same index label. Always call .reset_index(drop=True) immediately after unless you have a reason to keep the original index.

Assuming json_normalize descends into lists automatically

json_normalize only expands dictionaries at the top level. Nested lists of objects require explicit record_path arguments. If you see an object-dtype column after normalization that you expected to be flat, it's almost always because a list-of-dicts was left untouched.

Putting It All Together: A Full Example

Here's a complete workflow for a realistic multi-level API payload β€” normalize the outer structure, explode the nested array, and rename columns for clarity.

import pandas as pd
from pandas import json_normalize

response = [
    {
        "user_id": 101,
        "profile": {"name": "Alice", "tier": "gold"},
        "purchases": [
            {"item": "Widget A", "amount": 29.99, "tags": ["sale", "new"]},
            {"item": "Widget B", "amount": 14.99, "tags": ["new"]}
        ]
    },
    {
        "user_id": 102,
        "profile": {"name": "Bob", "tier": "silver"},
        "purchases": [
            {"item": "Gadget X", "amount": 9.99, "tags": []}
        ]
    }
]

# Step 1: flatten purchases array with parent meta
df = json_normalize(
    response,
    record_path="purchases",
    meta=["user_id", ["profile", "name"], ["profile", "tier"]],
    errors="ignore",
    sep="_"
)

# Step 2: explode the tags list into individual rows
df_tags = df.explode("tags").reset_index(drop=True)

# Step 3: rename columns for readability
df_tags.rename(columns={
    "profile_name": "user_name",
    "profile_tier": "user_tier"
}, inplace=True)

print(df_tags)

The final DataFrame has one row per purchase-tag combination, with user metadata repeated alongside each row. Every field from the original payload is accounted for.

Next Steps

You now have a solid toolkit for turning messy nested JSON into a flat, query-ready DataFrame. Here are four concrete things to do next:

  • Audit your API responses first. Print json.dumps(response[0], indent=2) on the first record before writing any normalization code. Knowing the exact shape saves debugging time.
  • Write a utility function. Wrap your specific normalization + explode chain into a reusable function with the schema documented. Future-you will appreciate it when the API changes.
  • Validate dtypes after loading. Call df.dtypes and check that no expected numeric column landed as object. If it did, you likely have a remaining nested structure somewhere.
  • Use pd.json_normalize with large files carefully. For payloads with tens of thousands of records and deep nesting, load with ijson for streaming rather than reading the entire file into memory first.
  • Explore PandaSQL or direct SQL alternatives if your JSON eventually lands in a database β€” querying structured data in SQL is often faster than DataFrame gymnastics once the data is stored.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.