SQLServer Delete Duplicate Rows

As a “high level DBA type”, I do get to see strange things occasionally. On a previous occasion it was the duplication of records in a database. Let’s talk about how to get rid of those records.

The first type of duplicated record we will work on is one where the primary key is duplicated, whether by import or you are trying to create a unique key against existing data.

-- Lets create some test data
CREATE TABLE #table1 (
	  Id INT
	, text1 VARCHAR(50)
	, dtc DATETIME2(0)
);
INSERT INTO #table1 (
	  Id
	, text1
	, dtc
)
VALUES 
	  (1, 'test1', DATEADD(d, -10, GETDATE()))
	, (1, 'test1', DATEADD(d, -7, GETDATE()))
	, (2, 'test2', DATEADD(d, -7, GETDATE()))
	, (2, 'test2', DATEADD(d, -10, GETDATE()))

SELECT text1
FROM #table1
GROUP BY text1
HAVING COUNT(*) > 1

When executing this, you can see that 2 of our records have the same text. If you delete by ID, you will lose both records. Sadly, this is an occasion where the recommended practice is to set the ROWCOUNT to 1 and delete by ID. You will find that you can not use an ORDER BY on a DELETE statement either.

SET ROWCOUNT 1
DELETE FROM #table1 WHERE Id = 1 
DELETE FROM #table1 WHERE Id = 2
SET ROWCOUNT 0

SELECT * FROM #table1

The bad part of this is that I can not reliably control which record gets deleted based on the data.

Id	text1	dtc
1	test1	2015-04-22 12:42:15
2	test2	2015-04-19 12:42:15

The more likely scenario is you have imported data and you already have a column with an auto-increment integer for the primary key. You have duplicate records but at least you have a unique identifier.

CREATE TABLE #table2 (
	  Id INT
	, text1 VARCHAR(50)
	, dtc DATETIME2(0)
);
INSERT INTO #table2 (
	  Id
	, text1
	, dtc
)
VALUES 
	  (1, 'test1.1', DATEADD(d, -10, GETDATE()))
	, (2, 'test1.2', DATEADD(d, -7, GETDATE()))
	, (3, 'test1.1', DATEADD(d, -7, GETDATE()))
	, (4, 'test1.2', DATEADD(d, -10, GETDATE()))

SELECT text1
FROM #table2
GROUP BY text1
HAVING COUNT(*) > 1

SELECT * FROM #table2

Executing that shows the data scenario we are dealing with now. In this case we want to keep the records that were created last.

DELETE FROM #table2
WHERE Id IN(
	SELECT DISTINCT m1.Id
	FROM #table2 m1 
		LEFT JOIN #table2 m2 ON m1.text1 = m2.text1 AND m1.dtc < m2.dtc
	WHERE m2.Id IS NOT NULL
)

SELECT * FROM #table2

The above shows we can select the IDs of the records in a subquery that have the lower create date and delete those. This gives us that small amount of control over which duplicate record we want to delete.

Id	text1	dtc
2	test1.2	2015-04-22 12:52:14
3	test1.1	2015-04-22 12:52:14

And if we decide we want the first record created instead of the last, we can switch the < to >.

But wait, our data is even more messed up than normal! We have records with the same primary key but different data and we want to keep it ALL!!!

CREATE TABLE #table3 (
	  Id INT
	, text1 VARCHAR(50)
	, dtc DATETIME2(0)
);
INSERT INTO #table3 (
	  Id
	, text1
	, dtc
)
VALUES 
	  (1, 'test1.1', DATEADD(d, -10, GETDATE()))
	, (1, 'test1.2', DATEADD(d, -7, GETDATE()))
	, (2, 'test2.1', DATEADD(d, -10, GETDATE()))
	, (2, 'test2.2', DATEADD(d, -7, GETDATE()))

SELECT * FROM #table3

Oh god, it’s horrible! Fix it please! And we can. We will dump the distinct records into a holding table of sorts, TRUNCATE the messed up table, and reinsert the records with new, distinct IDs.

SELECT DISTINCT text1, dtc
INTO #holdingTable
FROM #table3

TRUNCATE TABLE #table3

INSERT INTO #table3 (Id, text1, dtc)
SELECT ROW_NUMBER() OVER(ORDER BY text1, dtc), text1, dtc
FROM #holdingTable

SELECT * FROM #table3

Id	text1	tc
1	test1.1	2015-04-19 12:55:55
2	test1.2	2015-04-22 12:55:55
3	test2.1	2015-04-19 12:55:55
4	test2.2	2015-04-22 12:55:55

How much more messed up can we let our data get?

CREATE TABLE #table4(
      Id INT
    , firstName VARCHAR(20)
    , lastName VARCHAR(20)
    , email VARCHAR(50)
    , phone VARCHAR(10)
    , dateOfBirth DATE
	, rating INT
) 
 
INSERT INTO #table4 (Id, firstName, lastName, email, phone, dateOfBirth, rating) 
VALUES 
	  (1, 'Jack', 'Smith', NULL, NULL, DATEADD(yy, -30, GETDATE()), 5)
	, (2, 'Jack', 'Smith', 'jack.smith@email.com', NULL, NULL, 6)
	, (3, 'Jack', 'Smith', NULL, '2145556666', NULL, 2)
	, (4, 'Jill', 'Jones', NULL, NULL, DATEADD(yy, -25, GETDATE()), 1)
	, (5, 'Jill', 'Jones', 'jill.jones@email.com', NULL, NULL, 10)
	, (6, 'Jill', 'Jones', NULL, '2145557777', NULL, 0)
	, (7, 'Jimmy', 'Jackson', NULL, '2146668877', DATEADD(yy, -45, GETDATE()), 6)
	, (8, 'Jimmy', 'Jackson', 'jimmmy.jackson@email.com', NULL, NULL, 6)
	, (9, 'Jimmy', 'Jackson', 'jimmy@email.com', '2145558888', DATEADD(yy, -15, GETDATE()), 5);

SELECT * FROM #table4;

Don’t ask questions you don’t want to know the answer too. Now we have data that we want to consolidate to fill in the NULL values.

WITH cte AS (
	SELECT
		  Id
		, firstName
		, lastName
		, MAX(email) OVER (PARTITION BY firstName, lastName) AS emailUpdated
		, MAX(phone) OVER (PARTITION BY firstName, lastName) AS phoneUpdated
		, MAX(dateOfBirth) OVER (PARTITION BY firstName, lastName) AS dateOfBirthUpdated
		, ROW_NUMBER() OVER (PARTITION BY firstName, lastName ORDER BY firstName, lastName) AS Rn
		, MIN(rating) OVER (PARTITION BY firstName, lastName) AS ratingUpdated
	FROM #table4
) 
SELECT
	  Id
	, firstName
	, lastName
	, emailUpdated
	, phoneUpdated
	, dateOfBirthUpdated
	, ratingUpdated
	, Rn
INTO #noDups 
FROM cte 
WHERE Rn = 1

SELECT * 
FROM #noDups 

Id	firstName	lastName	emailUpdated	phoneUpdated	dateOfBirthUpdated	ratingUpdated	Rn
1	Jack	Smith	jack.smith@email.com	2145556666	1985-04-29	2	1
4	Jill	Jones	jill.jones@email.com	2145557777	1990-04-29	0	1
7	Jimmy	Jackson	jimmy@email.com	2146668877	2000-04-29	5	1

Isn’t that cool?

As always if you want the full script, here is the link to pastebin.com.

For the record, Avengers Age of Ultron comes out tomorrow!

Written on April 29, 2015