SQL Find Duplicates Like a Pro: 3 Guaranteed Techniques

dogs representing duplicate records

Aah…. duplicates! They are everywhere! Look around you – multiple charger cables, headphones, pictures in your smartphone! But we are not here to talk about those duplicates. No, Sir! We are here to address the duplicates in databases, how to find them and possibly resolve them in your SQL code.

In this SQL find duplicates post, let us look at 3 ways to identify duplicate rows/columns and then conclude by looking at 2 ways to mitigate them.

  1. SQL Find Duplicates using Count
  2. SQL Find Duplicates using MINUS Function
  3. SQL Find Duplicates using Analytic Functions

Let us start by looking at a very simple database table, USER_DIET. The below listed table shows the Fruit consumption of Sam and John over two days. Just by looking at the data can you tell if there are duplicates in the table, say for the column “NAME”?

NAMEFRUITDAY
JohnAppleMonday
SamOrangeMonday
JohnOrangeTuesday
SamBananaTuesday
JohnPeachWednesday
SamBananaWednesday

The most obvious answer is YES! John occurs 3 times and so does Sam.

How about if we were to look at columns NAME and FRUIT? Once again, the answer would be YES, because “Sam” and “Banana” occurs twice. Apparently, Sam loves bananas, while John prefers a different fruit every day.

Finally, lets look at columns NAME, FRUIT and DAY. Do you see any duplicates now?

The answer is NO. There are no duplicates because both Sam and John had a different fruit on each day.

The point I would like to drive home is this! To truly understand if data is duplicate, you need to understand the context and the functionality behind it.

1. SQL Find Duplicates using Count

The most common method to find duplicates is using the count function in a select statement. There are two other clauses that are key to finding duplicates: GROUP BY and HAVING.

Let us continue using the database table (USER_DIET) from the previous example and see if we can find duplicates for the NAME column.

a. SQL find duplicates in a column

SELECT name,count(*)
FROM user_diet
GROUP BY name
HAVING count(*)>1;
Output from SQL statement: 
NAME COUNT(*)
John 3
Sam 3

In this second example, let us look at finding duplicates in multiple columns: NAME and FRUIT. 

Lets think this thru and put things in context before diving into our select statement. As yourself, what am I trying to find here ? 

We are trying to find if any of the users, in this case, Sam/John had the same fruit twice. That it ! This context is based on the two fields NAME and FRUIT. 

b. SQL find duplicates in multiple columns

SELECT name, fruit, count(*)
FROM user_diet
GROUP BY name, fruit
HAVING count(*)>1;
Output from SQL statement: 
NAME FRUIT COUNT(*)
Sam Banana 2

 

Key to remember, the columns in the select statement, excluding the count(*) should be the exact same in the group by clause as well. 

Also note that using the count(*) function gives you a count of the number of occurrences of a value. In this case, “Sam” + “Banana” occurs twice in the table, but in actuality we only have one duplicate row.

2. SQL Find Duplicates using MINUS function

The MINUS function works on two tables ( or datasets) and returns rows from the first table that does not belong in the second table. This option using the MINUS function in SQL, to find duplicates, is specific to Oracle. Use it for awareness and to validate your results using the count(*) method. 

SQL to find duplicates using MINUS function and rowid

SELECT name, rowid FROM user_diet
MINUS
SELECT name, MIN(rowid) FROM user_diet
GROUP BY name; 
Output from SQL statement: 
NAME COUNT(*)
Sam 2

ROWID is a pseudo column in Oracle and contains a distinct ID for each row in a table.

The first select statement (before the MINUS function) returns 6 rows containing NAME and a distinct value for the ROWID column.  The second select statement on the other hand returns 2 rows, one for Sam and one for John. Why do you think that is ?

It’s because of the min function on the ROWID column.

The final output contains the “actual” number of duplicate rows, and not the total number of rows like the count(*) function.

SQL to find duplicates using MINUS function and rownum

SELECT name, rownum FROM user_diet 
MINUS
SELECT name, rownum FROM
(SELECT DISTINCT name FROM user_diet);
Output from SQL statement: 
NAME COUNT(*)
Sam 2

In this second example, we used ROWNUM, which is a pseudo column used to uniquely identify the order of each row in a select statement. 

So what’s the difference between ROWNUM and ROWID in our example ? 

They are both pseudocolumns in Oracle. ROWNUM is a number and is generated on the result of the SQL statement. ROWID on the other hand is associated with each row of a table. 

3. SQL Find Duplicates using Analytic functions

Analytics functions are used to perform calculations on a grouping of data, normally called a “window”. This technique can be a bit confusing if you are just starting off with SQL, but its definitely worth knowing.



SELECT
name, ROW_NUMBER() OVER ( PARTITION BY ssn ORDER BY ssn) AS rnum
FROM user_diet;




Output from SQL statement: 
NAME RNUM
John 1
John 2
John 3
Sam 1
Sam 2
Sam 3

What are we doing here ?

We are attempting to find if any duplicates exist for the column NAME. Lets see if we can break this SQL down and make sense of it. 

The function ROW_NUMBER() assigns a number starting at 1 to the rows returned by the PARTITION window. In our case, since we partitioned our dataset on the NAME column, we have 2 datasets: one for Sam and one for John. ROW_NUMBER() now assigns a unique number to each of the 3 rows for Sam, resets the counter and then does the same for John. The resulting output is as shown on the right side of the query.

One of the reasons I love this technique is because I can turn the above SQL into a nested subquery and get a distinct set of records as shown below.

SELECT name FROM (
SELECT
 name, ROW_NUMBER() OVER ( PARTITION BY name ORDER BY name) AS rnum FROM user_diet)
WHERE rnum = 1;

 

A final tidbit, SQL is not limited to transactional databases. Apache Spark has a module called Spark SQL to handle structured data. AWS Athena even lets you write SQL against files ! The demand for SQL skills is endless. So play around with what you learned here. Try selecting multiple columns, switch the PARTITIONS, change the SORT order. Practice is the best way to master something !

SQL helpful links

Oracle Database SQL Reference

For Oracle SQL syntax, expressions and function reference

Transact SQL (T-SQL) Reference

For SQL Server transact SQL functions, examples and syntax

MySQL Reference Manual

For MySQL related syntax, statements and examples

PostgreSQL Language Reference

For information related to PostgreSQL data types and functions

Email us at : info@obstkel.com

Copyright 2020 © OBSTKEL LLC. All rights Reserved.
Scroll to Top