SQL DISTINCT vs GROUP BY: When Each One Silently Lies to You

June 02, 2026 7 min read 36 views
Abstract flat illustration showing two overlapping data table shapes on a cool gradient background, representing SQL query deduplication concepts

You run a quick SELECT DISTINCT customer_id to count unique customers, paste the number into your report, and your manager flags it as wrong the next day. No error was thrown. The query looked right. But the result silently lied to you. This happens more often than most SQL guides admit, and it happens with GROUP BY just as easily.

Understanding why these two constructs behave differently β€” and when each one quietly misleads you β€” is one of the most practical things you can do for your data work.

What you'll learn

  • The actual difference between DISTINCT and GROUP BY at the query-engine level
  • Situations where DISTINCT gives you a row count that is technically correct but analytically wrong
  • Situations where GROUP BY silently swallows data you needed to see
  • How NULLs interact with both, and why that matters
  • A decision framework for picking the right tool every time

How They Actually Work

DISTINCT operates on the full output row. After the query engine evaluates everything else β€” joins, filters, computed columns β€” it scans the result set and removes any row that is a byte-for-byte duplicate of a row already seen. It does not aggregate. It does not compute. It just deduplicates.

GROUP BY, on the other hand, partitions rows into buckets defined by the columns you name, and then collapses each bucket into a single output row. Aggregation functions (COUNT, SUM, MAX, etc.) operate on each bucket separately. Even if you use no aggregate functions at all, the engine still performs the grouping step.

For simple single-table, single-column queries, the two are functionally equivalent and most databases produce identical query plans. The moment you add joins, multiple columns, or NULLs, the differences start to matter.

The Join Trap: Where DISTINCT Lies First

Imagine you have an orders table and a order_items table. You want the list of unique customer IDs who placed at least one order.

-- Looks innocent
SELECT DISTINCT o.customer_id
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id;

This query joins orders to their line items first. A customer with one order containing five items now appears five times before DISTINCT runs. DISTINCT then collapses those five rows back to one β€” and you get the right answer. Fine so far.

Now someone asks: "How many line items did each customer order?" A junior analyst does this:

-- Silently wrong
SELECT DISTINCT o.customer_id, oi.product_id
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id;

The intent was "unique customers and their products." What DISTINCT actually does is deduplicate on the combination of customer_id and product_id. If a customer ordered the same product in two different orders, that row appears twice in the result. No error. Just quietly wrong cardinality.

The correct approach when you need counts is GROUP BY with an explicit aggregate:

-- Intentional and correct
SELECT o.customer_id, COUNT(oi.item_id) AS total_items
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.customer_id;

The GROUP BY Trap: When It Swallows Rows

GROUP BY has its own silent failure mode. It happens when you group by a high-cardinality column that is not actually what you want to group by, or when you accidentally group by something that creates more buckets than you expected.

-- You think this counts orders per day
SELECT order_date, COUNT(*) AS order_count
FROM orders
GROUP BY order_date;

If order_date is stored as a TIMESTAMP instead of a DATE, every unique timestamp creates its own group. An order placed at 09:00:01 and one placed at 09:00:02 on the same day end up in separate buckets. Your "daily" report now has one row per second. No error message. The counts just look oddly low.

-- Correct: cast to date first
SELECT DATE(order_date) AS order_day, COUNT(*) AS order_count
FROM orders
GROUP BY DATE(order_date);

Another GROUP BY trap is including a column in your SELECT that is not in your GROUP BY and not wrapped in an aggregate. MySQL (with certain SQL modes disabled) will silently pick an arbitrary value from the group for that column. PostgreSQL and SQL Server throw an error, which is safer β€” but if you are working across database engines, you cannot rely on that protection.

NULLs: The Quiet Third Party

Both DISTINCT and GROUP BY treat NULL as a grouping value, which surprises many people. In standard SQL, NULL != NULL is true β€” two NULLs are not equal. But for deduplication and grouping purposes, the engines treat all NULLs in the same column as belonging to the same bucket.

-- Suppose email has three NULLs in it
SELECT DISTINCT email FROM users;

You get one row with NULL in the result, not three. That is usually what you want, but it means your distinct count of emails is not the same as your count of users with a known email address. If you are computing coverage metrics, this distinction matters.

-- Count non-NULL emails only
SELECT COUNT(email) AS known_emails,
       COUNT(*) AS total_users
FROM users;

COUNT(column) skips NULLs. COUNT(*) does not. Mixing the two up inside a GROUP BY query is another common source of silent incorrect results.

Performance: When the Difference Is Not Just Semantic

For large tables, the choice between DISTINCT and GROUP BY can have a real impact on query time, even when both return the same result.

Modern query optimizers in PostgreSQL, MySQL 8+, and SQL Server are often smart enough to compile a single-column DISTINCT into the same plan as a single-column GROUP BY. When you add joins, subqueries, or multiple columns, the optimizer has more choices to make and can take different paths for each construct.

A practical rule: if your query already needs GROUP BY to compute an aggregate, do not bolt on DISTINCT afterward to "clean up" the result. If you need DISTINCT after grouping, something in your query is producing duplicates you have not accounted for β€” fix the root cause instead.

-- Anti-pattern: DISTINCT on top of GROUP BY
SELECT DISTINCT customer_id, SUM(total) AS revenue
FROM orders
GROUP BY customer_id;

-- GROUP BY customer_id already guarantees one row per customer_id.
-- The DISTINCT is doing nothing and signals a misunderstanding.

Subquery and Window Function Interactions

Both constructs behave in non-obvious ways inside subqueries and alongside window functions.

DISTINCT inside a subquery

SELECT *
FROM (
  SELECT DISTINCT customer_id, order_date
  FROM orders
) AS deduped
WHERE order_date >= '2024-01-01';

The DISTINCT here deduplicates on both customer_id and order_date together. If the same customer placed two orders on the same date, only one row survives. That might be exactly what you want β€” or it might be hiding a duplicate order that should be investigated. Know your intent before writing the query.

GROUP BY with window functions

Window functions run after GROUP BY collapses rows. This means you cannot use a window function to "look back" at the original un-grouped rows once grouping has happened.

-- This will error: window function sees grouped rows, not raw rows
SELECT customer_id,
       SUM(total) AS revenue,
       ROW_NUMBER() OVER (ORDER BY order_date) AS rn  -- order_date is gone after GROUP BY
FROM orders
GROUP BY customer_id;

If you need both aggregation and row-level window functions, compute them in separate CTEs and join the results.

Common Pitfalls Checklist

  • DISTINCT on multiple columns β€” deduplicates the combination, not each column independently. A common misconception among analysts new to SQL.
  • GROUP BY on a TIMESTAMP column β€” creates one group per exact timestamp, not per day. Always cast to the desired granularity first.
  • COUNT(*) vs COUNT(column) β€” the difference is NULLs. Choose deliberately.
  • DISTINCT inside COUNT β€” COUNT(DISTINCT customer_id) is valid and useful, but make sure you understand it counts unique non-NULL values of that column, not unique rows in the result set.
  • MySQL non-strict mode β€” allows non-aggregated columns in SELECT when using GROUP BY, silently picking arbitrary values. Enable strict mode in production.
  • DISTINCT after GROUP BY β€” almost always a sign the underlying query has a logic error. Fix the query, not the symptom.

A Decision Framework

When you are about to write either construct, ask yourself these questions in order:

  1. Do I need an aggregate value (count, sum, average)? If yes, use GROUP BY. Full stop.
  2. Do I need only unique rows and nothing computed per group? If yes, DISTINCT is appropriate β€” but verify which columns you are deduplicating on.
  3. Am I working with a join? Understand whether the join multiplies rows before DISTINCT runs, and whether that multiplication is intentional.
  4. Are there NULLs in the columns I am deduplicating or grouping on? Decide upfront whether NULLs should be their own group or excluded entirely.
  5. Am I seeing unexpected row counts in my result? Add a plain SELECT * before the DISTINCT or GROUP BY to inspect the raw data the engine is working with.

Wrapping Up

Neither DISTINCT nor GROUP BY is wrong by itself. They are wrong when applied without understanding what the engine actually does with them. The silent part is what makes these bugs expensive β€” a wrong number in a dashboard or a report can travel a long way before anyone catches it.

Here are four concrete actions to take right now:

  • Audit one query you wrote recently that uses DISTINCT with more than one column. Verify that you meant to deduplicate on the combination, not the individual columns.
  • Check your date columns in any GROUP BY queries. Confirm they are DATE type, not TIMESTAMP, or that you are casting explicitly.
  • Replace COUNT(*) with COUNT(column) in any query where NULLs in that column would change the meaning of the metric.
  • Enable strict SQL mode in any MySQL database you control, so non-aggregated columns in SELECT produce an error instead of a silent wrong value.
  • Inspect the raw join output of any query where you are using DISTINCT to clean up after a join. If the join is multiplying rows unexpectedly, fix the join condition rather than masking the problem with DISTINCT.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.