Handling duplicate rows in SQL depends on whether you want to find, remove, or prevent them. I usually start by identifying duplicates, then decide whether to keep the first occurrence, aggregate them, or delete them based on the use case.
To find duplicate rows, I group by the columns that define uniqueness and use the HAVING clause:
SELECT EmployeeName, DepartmentID, COUNT(*) AS Count
FROM Employees
GROUP BY EmployeeName, DepartmentID
HAVING COUNT(*) > 1;
This shows which combinations appear more than once.
To delete duplicates while keeping one copy, one common approach is using a CTE with ROW_NUMBER():
WITH DuplicateCTE AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY EmployeeName, DepartmentID ORDER BY EmployeeID) AS row_num
FROM Employees
)
DELETE FROM DuplicateCTE WHERE row_num > 1;
Here, the ROW_NUMBER() function assigns a sequential number within each duplicate group, and rows with a number greater than 1 are deleted โ leaving only one record per group.
I used this technique in a data warehouse project, where data was coming from multiple source systems and duplicate customer records often got inserted. The CTE approach helped clean up large tables quickly and safely before running ETL transformations.
One challenge I faced was when dealing with millions of rows โ deleting in bulk could lock tables or slow down transactions. To fix this, I ran deletions in batches using transaction control (BEGIN TRAN, COMMIT) or temporary staging tables.
A limitation is that if you donโt define the right columns in your PARTITION BY or GROUP BY, you might accidentally delete legitimate records that just look similar.
As an alternative, if deleting isnโt preferred, I sometimes use SELECT DISTINCT to just remove duplicates in the query output level:
SELECT DISTINCT EmployeeName, DepartmentID FROM Employees;
And to prevent duplicates in the future, I apply constraints like PRIMARY KEY, UNIQUE, or indexing on the relevant columns โ that ensures data integrity right at the database level.
