The easy way to find and delete duplicates in SQL

Next in my series of helpful SQL code is duplicates, I seem to always be having issues with duplicates. This annoys me immensely as all of the issues comes from other peoples data I have to work with.

No problem I hear you say, google is awash with ways of how to delete them. Trouble is these are all based on blindly deleting superfluous rows without care of how they got there and any specific criteria on the order of how to delete them.

Identifying duplicates

First up is you have to identify the duplicates in order to understand why they are duplicated in the first place, using this nufty bit of code we can pull up a table only containing the duplicates:

select * 
from [#Table1]
where [Person] in (
    SELECT [Person]
    FROM [#Table1]
    group by [Person]
    HAVING (COUNT(*) > 1)
)

Deleting rows

Having checked the duplicates we can proceed to getting rid of them. If the entries are pure duplicates and we just want to delete the extra we have two ways of proceeding. We can either

  • delete the duplicates in situe or
  • or filter out the duplicates in a query

Delete duplicates from a table

Firstly we add an automatically incrementing number field to the table, then delete the record that is not the first occurrence of item in the selected field:

ALTER TABLE [#Table1] ADD AUTOID INT IDENTITY(1,1)
DELETE FROM [#Table1] WHERE AUTOID NOT IN (SELECT MIN(AUTOID) FROM [#Table1] GROUP BY [Person])

Nice and straight forward.

Filter out duplicates in a query

The trick with this one is that you just have to group by all the fields you want to un-duplicate, though this is invariably all of them!

select AccNo, Person, Bikes from #table1 group by AccNo, Person, Bikes

What you have to remember is that you need to know which fields are duplicated in the first place.

Smart deletion of duplicate records

This is the scenario where we've looked at the duplicates and decided the course of action needed to weed out the duplicates. This is usually to route out the oldest date or the lowest values, this is the solution you need - sort the dataset into a new table then go ahead and delete the superfluous rows:

select * into #Table2 from #Table1 order by [Date1] desc, [Field1] asc
ALTER TABLE #Table2 ADD AUTOID INT IDENTITY(1,1)
DELETE FROM #Table2 WHERE AUTOID NOT IN (SELECT MIN(AUTOID) FROM #Table2 GROUP BY [Person])
ALTER TABLE #Table1 drop column AUTOID

Now all we need to do is get people to check their data for duplicates before they send it to save us hours of work!


Search Posts

Back to top