SQL GROUP BY Duplicate and Inflated Counts: How to Fix Them
Your GROUP BY query returns a suspiciously large number β three times what you expected, or duplicate rows for the same key. The data looks right, the query looks right, and yet the totals are wrong. This is one of the most common silent bugs in SQL analytics, and it almost always has one of a handful of causes.
This article walks through each cause, shows you how to spot it, and gives you a concrete fix you can apply immediately.
What you'll learn
- Why JOINs are the most frequent source of inflated GROUP BY counts
- How to detect and eliminate duplicate rows before aggregation
- When to use
COUNT(DISTINCT β¦)versusCOUNT(*) - How a missing column in GROUP BY silently corrupts results
- Patterns for validating your aggregation output
Prerequisites
You should be comfortable writing basic SELECT, JOIN, and GROUP BY queries. The examples use standard SQL that runs on PostgreSQL and MySQL. Minor syntax differences may apply to other databases.
The JOIN Fanout Problem
The single biggest cause of inflated counts is a JOIN that multiplies rows before the aggregation runs. This is called a fanout.
Suppose you have an orders table and an order_items table with multiple rows per order. If you JOIN them and then GROUP BY customer_id, each order gets repeated once per item. Your COUNT of orders is now actually counting order-item pairs.
-- Broken: inflated because of the one-to-many JOIN
SELECT
o.customer_id,
COUNT(o.order_id) AS order_count
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.customer_id;If a customer has 3 orders and each order has 4 items, this returns 12 β not 3.
The fix depends on what you actually need. If you only need the order count, aggregate before joining:
-- Fixed: aggregate orders first, then join if needed
SELECT
o.customer_id,
COUNT(DISTINCT o.order_id) AS order_count
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.customer_id;Or better yet, avoid the join entirely if order_items isn't needed for the output:
SELECT customer_id, COUNT(order_id) AS order_count
FROM orders
GROUP BY customer_id;Duplicate Source Rows
Sometimes the source table itself has duplicates β either from a bad ETL load, a UNION without deduplication, or a view that doesn't enforce uniqueness. When you group over a table with duplicate rows, every aggregate you compute will be inflated.
Check for duplicates before trusting any aggregate:
-- Spot duplicate rows by the natural key
SELECT order_id, COUNT(*) AS row_count
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;If this returns any rows, your source data is dirty. You have two options: fix the upstream load, or deduplicate inline using a CTE or subquery.
-- Deduplicate with ROW_NUMBER before aggregating
WITH deduped AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY created_at DESC) AS rn
FROM orders
)
SELECT customer_id, COUNT(order_id) AS order_count
FROM deduped
WHERE rn = 1
GROUP BY customer_id;COUNT(*) vs COUNT(DISTINCT key)
COUNT(*) counts every row in the group, including duplicates introduced by joins or repeated source data. COUNT(DISTINCT column) counts unique values of that column in the group.
Neither is always correct β it depends on what you're measuring. Use COUNT(*) when you trust that each row represents exactly one thing. Use COUNT(DISTINCT id) when the same logical entity may appear more than once in your dataset.
-- How many unique products were ordered per customer?
SELECT
o.customer_id,
COUNT(DISTINCT oi.product_id) AS unique_products
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.customer_id;A common mistake is reaching for COUNT(DISTINCT β¦) as a band-aid without understanding why duplicates exist. Fix the data model or the join first; DISTINCT should be intentional, not a patch.
Missing Columns in GROUP BY
In strict SQL databases (PostgreSQL, for example), if a column appears in your SELECT but not in GROUP BY and is not inside an aggregate function, the query errors out. But MySQL in its default mode and some other engines will silently pick an arbitrary value for that column per group.
This causes subtle result corruption rather than obvious inflation. The row count may look right, but the values in non-aggregated columns are unpredictable.
-- Dangerous in MySQL: order_date is not in GROUP BY
SELECT customer_id, order_date, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id; -- MySQL may allow this; PostgreSQL will refuseThe fix is to always include every non-aggregated column in GROUP BY, or use an aggregate function like MAX(order_date) if you only need a representative value.
-- Correct: explicit aggregation of order_date
SELECT
customer_id,
MAX(order_date) AS latest_order_date,
COUNT(*) AS order_count
FROM orders
GROUP BY customer_id;Multiple Aggregations on Different Grains
Sometimes you need two aggregates that come from different tables at different levels of granularity β for example, total revenue from order_items and total shipments from a shipments table, both grouped by customer.
Joining both tables before aggregating creates a cross-product of rows, inflating both figures at once. The right pattern is to pre-aggregate each table separately and then join the summaries.
-- Pre-aggregate each table independently
WITH revenue_by_customer AS (
SELECT o.customer_id, SUM(oi.amount) AS total_revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.customer_id
),
shipments_by_customer AS (
SELECT customer_id, COUNT(shipment_id) AS shipment_count
FROM shipments
GROUP BY customer_id
)
SELECT
r.customer_id,
r.total_revenue,
s.shipment_count
FROM revenue_by_customer r
LEFT JOIN shipments_by_customer s ON s.customer_id = r.customer_id;This pattern keeps each aggregation isolated, so neither table inflates the other.
HAVING vs WHERE: Filtering After vs Before Grouping
A misplaced filter can change which rows participate in the aggregation, producing unexpected group counts. WHERE filters rows before grouping; HAVING filters groups after aggregation.
If you filter on an aggregate condition using WHERE, most databases will reject the query. But if you mistakenly filter a non-aggregate column in HAVING, you may get groups that contain fewer rows than you expect β leading to undercounts rather than overcounts, which can be just as misleading.
-- Wrong: WHERE cannot reference an aggregate
SELECT customer_id, COUNT(*) AS order_count
FROM orders
WHERE COUNT(*) > 5 -- error
GROUP BY customer_id;
-- Correct
SELECT customer_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 5;Common Pitfalls at a Glance
| Symptom | Likely Cause | Fix |
|---|---|---|
| Count is a multiple of expected value | One-to-many JOIN fanout | Aggregate before joining or use COUNT(DISTINCT) |
| Same group key appears twice | Duplicate source rows | Deduplicate with ROW_NUMBER or fix the ETL |
| Non-key column has wrong value | Column missing from GROUP BY | Add to GROUP BY or wrap in aggregate |
| Two metrics both look inflated | Multi-table join before aggregation | Pre-aggregate each table in a CTE |
| Groups disappear after filter | WHERE applied instead of HAVING | Move condition to HAVING clause |
Validating Your Aggregation Results
Before shipping a report or dashboard, run a quick sanity check. Compare the aggregated total against a known baseline β for example, SUM of a revenue column in the raw table versus the same sum in the grouped output.
-- Baseline: total revenue from raw table
SELECT SUM(amount) FROM order_items;
-- Verify the grouped query matches
SELECT SUM(total_revenue)
FROM (
SELECT customer_id, SUM(amount) AS total_revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY customer_id
) sub;If the two numbers don't match, you have a fanout or duplication issue in the grouped query. This two-query check takes thirty seconds and can save hours of downstream confusion.
You can also check the row count of your grouped output against the expected number of distinct keys:
-- These two numbers should match
SELECT COUNT(DISTINCT customer_id) FROM orders;
SELECT COUNT(*) FROM (
SELECT customer_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id
) sub;Wrapping Up
Most inflated GROUP BY counts trace back to one of a handful of root causes. Here are concrete actions to take right now:
- Audit your JOINs. Any time you join a table with a one-to-many relationship before aggregating, check whether the join is necessary at that stage. Move it after the GROUP BY, or pre-aggregate in a CTE.
- Check source tables for duplicates. Run the
HAVING COUNT(*) > 1check on your natural key before trusting any aggregate from that table. - Be intentional with COUNT(DISTINCT). Use it when you mean it, not as a default. If it patches an inflation problem without explaining why, find and fix the root cause.
- Validate totals against a raw baseline. A quick
SUMcomparison between the raw table and your grouped output catches fanout issues instantly. - Enable strict SQL mode in MySQL if you're not already using it. It prevents the silent arbitrary-value behavior for non-aggregated columns and forces you to write explicit, correct GROUP BY clauses.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!