Wednesday, 16 July 2025

Cleaning Up Duplicate Records in SQL Server: A Practical Walkthrough

🧹 How to Identify and Remove Duplicate Records in SQL Server

Duplicate records are a common headache in databases. They mess up reporting, inflate numbers, and introduce inconsistencies that can quietly break things. Whether you're cleaning legacy data or preventing future problems, knowing how to find and eliminate duplicates is an essential SQL skill.

Here’s a practical approach using a sample table we'll call xxyyzz

Step 1: Explore the Table

Start by reviewing what’s in the table:

SELECT * FROM xxyyzz;

This gives you the lay of the land. Before doing anything destructive, it's crucial to understand which columns matter when determining if two rows are considered duplicates.

Step 2: Detect Duplicates

Let’s say duplicates are defined by a combination of columns — for example, everything except a unique ID or timestamp. You can group by those columns and count how many times each combination appears.

SELECT 

    column1, column2, column3, ..., columnN,

    COUNT(*) AS DuplicateCount

FROM xxyyzz

GROUP BY 

    column1, column2, column3, ..., columnN

HAVING COUNT(*) > 1;

Replace column1 through columnN with the fields that define a unique row in your context.

Step 3: Remove Duplicates Using a CTE

Now comes the cleanup. You want to keep one row from each group and delete the rest. This is where a Common Table Expression (CTE) with ROW_NUMBER() comes in.

WITH DuplicatesCTE AS (

    SELECT *,

        ROW_NUMBER() OVER (

            PARTITION BY column1, column2, column3, ..., columnN

            ORDER BY ID -- or CreatedDate, or another tie-breaker

        ) AS rn

    FROM xxyyzz

)

DELETE FROM DuplicatesCTE WHERE rn > 1;

What this does:

  • The ROW_NUMBER() function assigns a row number to each record in each duplicate group.

  • Rows are grouped by the columns that define uniqueness.

  • The row with rn = 1 (usually the oldest or lowest ID) is kept.

  • Everything else (rn > 1) is deleted.

⚠️ Always back up your data or test in a staging environment before running deletes.

✅ Optional: Prevent Future Duplicates

If this table should never have duplicates in the first place, consider enforcing a constraint with a unique index:

CREATE UNIQUE INDEX UX_xxyyzz_UniqueRows

ON xxyyzz (column1, column2, column3, ..., columnN);

This will block any future insertions that would create another duplicate.

Final Thoughts

Cleaning duplicates is routine, but it pays off every time. It enhances data accuracy, streamlines queries, and optimises performance. The combo of GROUP BY, ROW_NUMBER(), and a targeted approach DELETE gives you a clean, reliable fix.

Let this be part of your standard SQL toolkit — and if you need help customising it for a specific dataset or situation, just ask. 

 

Cleaning Up Duplicate Records in SQL Server: A Practical Walkthrough

🧹 How to Identify and Remove Duplicate Records in SQL Server Duplicate records are a common headache in databases. They mess up reporting,...