SQL DISTINCT vs GROUP BY Giving Different Row Counts Explained
You write a query with SELECT DISTINCT, check the row count, then rewrite it with GROUP BY expecting the same result β and the numbers don't match. Before you start doubting your data, know this: the difference is almost always intentional behavior, not a bug. Understanding why they diverge will save you a lot of time.
What you'll learn
- The fundamental difference between
DISTINCTandGROUP BYin SQL - Four concrete scenarios that cause different row counts between the two
- How NULL handling differs and silently changes your results
- How to verify which result is actually correct for your use case
- Performance tradeoffs so you can pick the right tool for the job
The problem: same query, different counts
Consider a sales table with columns customer_id, region, and amount. You run two queries that look equivalent:
-- Query A
SELECT DISTINCT customer_id, region
FROM sales;
-- Query B
SELECT customer_id, region
FROM sales
GROUP BY customer_id, region;
In this simple form, both queries return the same rows. Now add an aggregated column to Query B:
-- Query B (modified)
SELECT customer_id, region, SUM(amount) AS total
FROM sales
GROUP BY customer_id, region;
Suddenly you cannot run the equivalent with DISTINCT without also selecting amount β and as soon as you do that, the row counts can diverge. The divergence always has a root cause. Let's go through each one.
What DISTINCT actually does
SELECT DISTINCT eliminates duplicate rows from the result set after the column list in your SELECT is evaluated. It operates on the entire projected row β every column you listed. Two rows are considered duplicates only if all selected column values are identical.
SELECT DISTINCT customer_id, region, amount
FROM sales;
If the same customer bought in the same region with the same exact amount twice, those two rows collapse into one. If the amounts differ, both rows survive. DISTINCT does no aggregation β it simply deduplicates the projected output.
What GROUP BY actually does
GROUP BY partitions rows into groups, one group per unique combination of the grouped columns. Any column in your SELECT that is not in the GROUP BY clause must be wrapped in an aggregate function (SUM, COUNT, MAX, etc.).
SELECT customer_id, region, COUNT(*) AS purchase_count
FROM sales
GROUP BY customer_id, region;
Here, each unique (customer_id, region) pair produces exactly one output row, regardless of how many source rows it contains. GROUP BY is fundamentally about collapsing and summarizing. DISTINCT is about deduplicating a flat projection. That distinction is where the trouble starts.
Why the counts diverge: four concrete scenarios
Scenario 1: NULL values
NULL handling is the most common hidden cause of mismatched counts. DISTINCT treats NULLs as equal β two NULL values in the same column collapse into one row. GROUP BY does the same thing. So far, identical. But when you join GROUP BY with COUNT, the behavior can look very different.
-- Returns one row for NULL region
SELECT DISTINCT region FROM sales;
-- Also returns one row for NULL region
SELECT region FROM sales GROUP BY region;
The counts match here. But if you filter NULLs in a WHERE clause on one query and not the other, or if a join produces NULLs on one side, counts can silently diverge. If you're already investigating NULL-related GROUP BY quirks, the article on SQL GROUP BY silently excluding NULLs covers that in detail.
Scenario 2: Aggregated columns included in SELECT
This is the most direct cause of different row counts. The moment you add a non-grouped column to your SELECT DISTINCT, you change what "duplicate" means.
-- Each unique (customer_id, region, amount) is a distinct row
SELECT DISTINCT customer_id, region, amount
FROM sales;
-- Each unique (customer_id, region) is one row, amounts summed
SELECT customer_id, region, SUM(amount) AS total
FROM sales
GROUP BY customer_id, region;
Say customer 42 in the North region made three purchases: $10, $20, and $20. The DISTINCT query returns two rows (the $20 pair collapses). The GROUP BY query returns one row with a total of $50. These are answering fundamentally different questions β and both results are correct, just for different problems.
Scenario 3: DISTINCT ON vs GROUP BY in PostgreSQL
PostgreSQL offers DISTINCT ON (column), which is not standard SQL and behaves differently from both SELECT DISTINCT and GROUP BY.
-- PostgreSQL only: keep the first row per customer_id
SELECT DISTINCT ON (customer_id)
customer_id, region, amount
FROM sales
ORDER BY customer_id, amount DESC;
DISTINCT ON keeps exactly one row per unique value of the specified column β chosen by the ORDER BY. GROUP BY without aggregation is not allowed in standard SQL for the non-grouped columns. The row counts from DISTINCT ON (customer_id) will equal the number of unique customer_id values. A GROUP BY customer_id with all columns selected would require aggregating region and amount, which changes what you get back. These are genuinely different operations.
Scenario 4: Joins amplifying duplicates before deduplication
When you join tables before applying DISTINCT or GROUP BY, the join can multiply rows in ways that affect the two operations differently.
-- sales joined to order_items: each sale row may match multiple items
SELECT DISTINCT s.customer_id, s.region
FROM sales s
JOIN order_items oi ON s.sale_id = oi.sale_id;
SELECT s.customer_id, s.region
FROM sales s
JOIN order_items oi ON s.sale_id = oi.sale_id
GROUP BY s.customer_id, s.region;
Both of these return the same row count here β because the SELECT and GROUP BY columns are identical. But if you add oi.item_price to the SELECT DISTINCT, row counts explode because each item price variation survives deduplication. This is a common trap when you copy a GROUP BY query and swap it for DISTINCT without thinking about what columns are being exposed. For a deeper look at how joins amplify duplicates, check out this post on Power BI relationships causing duplicate rows in visuals β the join semantics apply equally in SQL.
How to verify which result is correct
When your counts don't match, run this diagnostic sequence. First, check what columns each query is actually comparing:
-- Are the column lists truly equivalent?
SELECT DISTINCT customer_id, region FROM sales; -- compares both columns
SELECT customer_id FROM sales GROUP BY customer_id; -- compares only one
Second, count NULLs in the relevant columns:
SELECT
COUNT(*) AS total_rows,
COUNT(customer_id) AS non_null_customer,
COUNT(region) AS non_null_region
FROM sales;
Third, use a HAVING clause to find groups with more than one source row, which tells you where GROUP BY is collapsing data that DISTINCT would not:
SELECT customer_id, region, COUNT(*) AS row_count
FROM sales
GROUP BY customer_id, region
HAVING COUNT(*) > 1;
Any row in this result is a place where GROUP BY reduced multiple source rows to one. SELECT DISTINCT customer_id, region would do the same β but if you add any varying column to the DISTINCT select list, those rows will reappear.
Finally, confirm your intent: are you trying to deduplicate a flat list, or summarize grouped data? If it's the former, use DISTINCT. If it's the latter, use GROUP BY. Mixing the two mental models is what causes the confusion in the first place.
Performance considerations
For simple deduplication of the same column set, most modern query optimizers (PostgreSQL, SQL Server, MySQL 8+) generate equivalent or near-equivalent execution plans for DISTINCT and GROUP BY. You can verify this with EXPLAIN or EXPLAIN ANALYZE.
EXPLAIN ANALYZE
SELECT DISTINCT customer_id, region FROM sales;
EXPLAIN ANALYZE
SELECT customer_id, region FROM sales GROUP BY customer_id, region;
In practice, DISTINCT is slightly cleaner to write when you only need deduplication. GROUP BY is the right choice the moment you need any aggregation. Using GROUP BY purely for deduplication (without aggregate functions) works, but a reader of your code will wonder if you forgot a COUNT or SUM. Clarity matters.
On large datasets, both operations need a sort or hash-based deduplication step. If the column you're deduplicating is indexed, the optimizer may use the index to skip the sort entirely. If your query is slow, look at whether adding an index on the GROUP BY or DISTINCT columns helps. The same principles that apply to window function performance apply here β see the breakdown on SQL window functions giving wrong totals when PARTITION BY is missing for context on how the optimizer handles partitioned vs. grouped operations.
Common pitfalls to watch for
- Adding a non-grouped column to SELECT DISTINCT changes the granularity. Every extra column you add is a new dimension of uniqueness. A column with high cardinality (like a timestamp or price) will drastically increase the row count.
- Using DISTINCT inside COUNT(
DISTINCT column) is not the same as SELECT DISTINCT.COUNT(DISTINCT customer_id)counts unique non-NULL values of that column. It does not deduplicate the result set. - MySQL's historical lenient GROUP BY. In older MySQL configurations (
ONLY_FULL_GROUP_BYdisabled), you could writeGROUP BY customer_idwhile selectingregionwithout an aggregate. The database would silently pick an arbitrary value forregion. This is a major source of wrong results. Always enableONLY_FULL_GROUP_BYor use standard-compliant SQL. - Subqueries and CTEs hiding row multiplication. If your
FROMclause references a subquery or CTE that joins tables, the row count entering yourDISTINCTorGROUP BYmay already be inflated. Debug by running the inner query first and checking its count. - DISTINCT on a SELECT * after a join. This is almost always a mistake. You're deduplicating every column from every joined table, which rarely does what you intend.
These patterns appear in many forms across different tools. If you're working in pandas and hitting similar deduplication issues after reshaping, the guide on pandas melt and stack producing duplicate rows covers the equivalent traps in Python.
Wrapping up
DISTINCT and GROUP BY are not interchangeable, even when they occasionally return the same result. Here are the concrete actions to take from here:
- Audit your SELECT list. Every column you add to a
SELECT DISTINCTchanges what counts as a duplicate. Make sure the columns match the question you're asking. - Run EXPLAIN on both queries when you see a count mismatch. The execution plan often reveals where a join is multiplying rows before deduplication kicks in.
- Check for NULLs in your key columns using
COUNT(*)vsCOUNT(column)and cross-reference with your GROUP BY output. - Use GROUP BY when you need aggregation, and DISTINCT when you only need a deduplicated list with no summary columns.
- Enable ONLY_FULL_GROUP_BY in MySQL environments to prevent the database from silently picking arbitrary column values in non-grouped columns.
Frequently Asked Questions
Why does SELECT DISTINCT return more rows than GROUP BY on the same table?
SELECT DISTINCT returns more rows when your SELECT list includes a column with varying values that GROUP BY would collapse via aggregation. Every unique combination of all selected columns survives DISTINCT, so a high-cardinality column like a price or timestamp will produce far more rows than a GROUP BY on just the key columns.
Is SELECT DISTINCT slower than GROUP BY in SQL?
For identical column sets with no aggregation, most modern databases generate the same or very similar execution plans for both. The practical performance difference is negligible. GROUP BY becomes the better choice the moment you need aggregate functions, while DISTINCT is cleaner for pure deduplication.
Can DISTINCT and GROUP BY ever return different results even with the same columns?
Yes, in MySQL with ONLY_FULL_GROUP_BY disabled, GROUP BY can silently return an arbitrary value for non-grouped columns, while DISTINCT always uses the actual row values. Also, DISTINCT ON in PostgreSQL is a different operator entirely and will produce one row per specified key, which is not equivalent to SELECT DISTINCT on all columns.
How do NULL values affect DISTINCT vs GROUP BY row counts?
Both DISTINCT and GROUP BY treat multiple NULL values as a single group, so a column with many NULLs will collapse them into one output row under either approach. The divergence appears when a join produces NULLs on one side or when a WHERE clause filters NULLs differently across two versions of your query.
When should I use COUNT(DISTINCT column) instead of SELECT DISTINCT?
Use COUNT(DISTINCT column) when you want a single number representing how many unique non-NULL values exist in a column, without retrieving the individual values. Use SELECT DISTINCT when you need the actual deduplicated rows returned as a result set for further use or inspection.
π€ Share this article
Sign in to saveRelated Articles
Data & Analytics
Power BI Time Intelligence Giving Wrong Results with Non-Standard Calendars
9m read
Data & Analytics
Power BI Cumulative Totals Resetting Unexpectedly: DAX Fixes That Actually Work
8m read
Data & Analytics
Pandas melt and stack Producing Duplicate Rows: Reshaping Pitfalls Fixed
4m read
Comments (0)
No comments yet. Be the first!