Deduplicating Postgres Rows with ROW_NUMBER When No Primary Key Exists

May 18, 2026 6 min read 56 views
Flat illustration of duplicate database rows stacked on a blue gradient, with one amber row highlighted to show deduplication in progress

You've inherited a Postgres table with no primary key, and it's full of duplicate rows. Maybe a broken ETL job ran twice, or someone did a bulk insert without a uniqueness check. Whatever the cause, you need to remove the duplicates without deleting the originals β€” and without a primary key to hang your hat on.

The good news is that PostgreSQL's ROW_NUMBER() window function gives you a reliable, precise way to label each row and delete only the extras. No extensions required, no table rebuilds.

What you'll learn

  • How ROW_NUMBER() assigns a rank to duplicate rows
  • How to use a CTE to isolate and delete duplicates safely
  • How to handle the case where rows are completely identical (no distinguishing column at all)
  • Common mistakes that silently delete the wrong rows
  • How to add a primary key afterward so this never happens again

Prerequisites

You need Postgres 9.1 or later (window functions have been available since 8.4, and CTEs since 8.4 too β€” you're almost certainly fine). You should be comfortable writing basic SELECT and DELETE statements. The examples below use psql, but any Postgres client works.

Understanding the Problem

Without a primary key, Postgres has no built-in way to distinguish between two otherwise identical rows. Internally, every row does have a system column called ctid that holds its physical location on disk β€” and that's the key to the whole approach.

The strategy is: use ROW_NUMBER() partitioned by the columns that define a duplicate, then delete every row whose number is greater than 1. The first occurrence survives; all copies go.

Setting Up a Test Table

Start with a reproducible example so you can verify your query before running it against production data.

CREATE TABLE orders (
    customer_id  INT,
    product_code TEXT,
    order_date   DATE,
    amount       NUMERIC(10, 2)
);

INSERT INTO orders VALUES
    (1, 'WIDGET-A', '2024-01-15', 49.99),
    (1, 'WIDGET-A', '2024-01-15', 49.99),  -- duplicate
    (2, 'GADGET-B', '2024-01-16', 129.00),
    (3, 'WIDGET-A', '2024-01-15', 49.99),  -- different customer, not a dup
    (2, 'GADGET-B', '2024-01-16', 129.00); -- duplicate

This table has no primary key. Rows 1 and 2 are duplicates of each other, and rows 3 and 5 are duplicates of each other (same customer, product, date, and amount). Row 4 shares the same product and date but a different customer β€” it is not a duplicate.

Using ROW_NUMBER() to Spot Duplicates

Before deleting anything, run a SELECT to confirm your window function is labelling rows the way you expect.

SELECT
    customer_id,
    product_code,
    order_date,
    amount,
    ROW_NUMBER() OVER (
        PARTITION BY customer_id, product_code, order_date, amount
        ORDER BY (SELECT NULL)  -- no meaningful order, so arbitrary
    ) AS rn
FROM orders;

The PARTITION BY clause defines what makes two rows duplicates. Every combination of those columns starts a fresh numbering sequence. The first row in each partition gets rn = 1, the second gets rn = 2, and so on. You want to delete every row where rn > 1.

The ORDER BY (SELECT NULL) is a common trick when you genuinely don't care which copy survives. If you want a deterministic choice β€” for example, keep the row with the earliest order_date β€” replace it with ORDER BY order_date ASC.

Deleting the Duplicates with a CTE

Postgres lets you use a data-modifying CTE to delete rows identified by a subquery. This is the cleanest pattern for deduplication.

WITH duplicates AS (
    SELECT
        ctid,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id, product_code, order_date, amount
            ORDER BY ctid  -- ctid gives a stable, deterministic tiebreak
        ) AS rn
    FROM orders
)
DELETE FROM orders
WHERE ctid IN (
    SELECT ctid FROM duplicates WHERE rn > 1
);

A few things to note about this query. The CTE selects each row's ctid alongside its row number. The outer DELETE uses that ctid list to remove only the extra copies. Using ctid in the ORDER BY inside the window function gives you a stable tiebreak β€” the row with the lower physical address is kept.

Always run this inside a transaction and verify the row count before committing. Wrap the whole thing in BEGIN; ... ROLLBACK; first, check the count, then re-run with COMMIT;.

BEGIN;

WITH duplicates AS (
    SELECT
        ctid,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id, product_code, order_date, amount
            ORDER BY ctid
        ) AS rn
    FROM orders
)
DELETE FROM orders
WHERE ctid IN (
    SELECT ctid FROM duplicates WHERE rn > 1
);

-- Check: should show only unique rows
SELECT * FROM orders;

-- If it looks right:
COMMIT;

-- If something looks wrong:
-- ROLLBACK;

Handling Completely Identical Rows

The approach above works even when every column is identical, because ctid is always unique per row β€” it reflects the row's physical location, not its data. That's why you pass ctid to the ORDER BY instead of a business column.

If you're on a very old Postgres version where ctid feels fragile (it can theoretically change after a VACUUM FULL or table rewrite), another safe approach is to copy the keepers into a staging table, truncate the original, then insert the keepers back.

-- Step 1: create a staging table with only unique rows
CREATE TABLE orders_clean AS
SELECT DISTINCT ON (customer_id, product_code, order_date, amount)
    customer_id, product_code, order_date, amount
FROM orders
ORDER BY customer_id, product_code, order_date, amount, ctid;

-- Step 2: swap the tables
BEGIN;
TRUNCATE orders;
INSERT INTO orders SELECT * FROM orders_clean;
COMMIT;

-- Step 3: clean up
DROP TABLE orders_clean;

The DISTINCT ON shortcut is actually more readable for simple cases, but it doesn't give you the row-level control that ROW_NUMBER() does when you need to keep a specific copy based on complex ordering logic.

Common Pitfalls

Forgetting NULL columns in PARTITION BY

If any column in your PARTITION BY contains NULL, two rows with NULL in that column will not be treated as duplicates of each other by default, because NULL != NULL in SQL. Use COALESCE to substitute a sentinel value if you want nulls to match.

PARTITION BY customer_id, COALESCE(product_code, ''), order_date, amount

Partitioning on too few columns

If you only partition by customer_id and accidentally omit order_date, you'll mark a legitimate second purchase from the same customer as a duplicate. Double-check that your PARTITION BY includes every column that distinguishes a genuine record from a copy.

Running the delete without a transaction

A plain DELETE outside a transaction auto-commits immediately. There's no undo. Always wrap deduplication deletes in BEGIN ... COMMIT and verify the result set before committing.

Trusting ctid across a VACUUM FULL

ctid is a physical pointer, and Postgres can reassign it during a VACUUM FULL or CLUSTER. As long as your CTE and DELETE happen in the same transaction (which they do in the pattern above), this is not a problem β€” no vacuum runs mid-transaction. Just don't store ctid values in a separate table and reuse them later.

Verifying the Result

After committing, confirm there are no more duplicates with a quick aggregation query.

SELECT
    customer_id,
    product_code,
    order_date,
    amount,
    COUNT(*) AS occurrences
FROM orders
GROUP BY customer_id, product_code, order_date, amount
HAVING COUNT(*) > 1;

If this returns zero rows, you're clean. If it returns rows, re-examine your PARTITION BY columns β€” you may have missed a column that varies between the duplicates.

Adding a Primary Key to Prevent Recurrence

Once the table is clean, add a surrogate primary key so this situation can't creep back in.

ALTER TABLE orders ADD COLUMN id SERIAL PRIMARY KEY;

If you also want to enforce uniqueness at the data level, add a unique constraint on the business columns:

ALTER TABLE orders
    ADD CONSTRAINT uq_orders_business_key
    UNIQUE (customer_id, product_code, order_date, amount);

This will reject any future insert that would create a duplicate, failing fast at the database level rather than silently accumulating bad data.

Next Steps

  • Run the SELECT version of the ROW_NUMBER() query against your actual table first and read through the output before touching DELETE.
  • Wrap every deduplication run in BEGIN ... COMMIT and checkpoint with a row count check.
  • Add a surrogate primary key (SERIAL or BIGSERIAL) to any table that currently lacks one.
  • Add a UNIQUE constraint on the natural key columns to block future duplicates at insert time.
  • If you're deduplicating a very large table (millions of rows), consider batching the deletes or doing the staging-table swap approach to avoid long-running locks.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.