obstkel.com logo

SQL Find Duplicates Like a Pro: 3 Guaranteed Techniques

3 dogs representing duplicate records in sql

Aah…. duplicates! They are everywhere! Look around you – multiple charger cables, headphones, pictures in your smartphone! But we are not here to talk about those duplicates. No, Sir! We are here to address the duplicates in sql, how to find them and possibly resolve them in your SQL code.

In this SQL find duplicates post, let us look at 3 ways to identify duplicate rows/columns and then conclude by looking at 2 ways to mitigate them.

  1. Using Count
  2. MINUS Function
  3. Analytic Functions


Let us start by looking at a very simple database table, USER_DIET. The below listed table shows the Fruit consumption of Sam and John over two days.

Just by looking at the data can you tell if there are duplicates in the table, say for the column “NAME”?

NAMEFRUITDAY
JohnAppleMonday
SamOrangeMonday
JohnOrangeTuesday
SamBananaTuesday
JohnPeachWednesday
SamBananaWednesday

The most obvious answer is YES! John occurs 3 times and so does Sam.

How about if we were to look at columns NAME and FRUIT? Once again, the answer would be YES, because “Sam” and “Banana” occurs twice. Apparently, Sam loves bananas, while John prefers a different fruit every day.

Finally, let’s look at columns NAME, FRUIT and DAY. Do you see any duplicates now?

The answer is NO. There are no duplicates because both Sam and John had a different fruit on each day.

The point I would like to drive home is this! To truly understand if data is duplicate, you need to understand the context and the functionality behind it.


Note
: All SQL examples below use Oracle SQL syntax. However, they should work across most relational databases with minimal changes.


Related post: Apache Spark SQL date functions

1. SQL Find Duplicates using Count

The most common method to find duplicates in sql is using the count function in a select statement. There are two other clauses that are key to finding duplicates: GROUP BY and HAVING.

Let us continue using the database table (USER_DIET) from the previous example and see if we can find duplicates for the NAME column.

a. Duplicates in a single column

SELECT name,count(*)
FROM user_diet
GROUP BY name
HAVING count(*)>1;
Output from SQL statement: 
NAME COUNT(*)
John 3
Sam 3

In this second example, let us look at finding duplicates in multiple columns: NAME and FRUIT. 

Lets think this thru and put things in context before diving into our select statement. As yourself, what am I trying to find here ? 

We are trying to find if any of the users, in this case, Sam/John had the same fruit twice. That it ! This context is based on the two fields NAME and FRUIT. 

 

b. Duplicates in multiple columns

SELECT name, fruit, count(*)
FROM user_diet
GROUP BY name, fruit
HAVING count(*)>1;
Output from SQL statement: 
NAME FRUIT COUNT(*)
Sam Banana 2

 

Key to remember, the columns in the select statement, excluding the count(*) should be the exact same in the group by clause as well. 

Also note that using the count(*) function gives you a count of the number of occurrences of a value. In this case, “Sam” + “Banana” occurs twice in the table, but in actuality we only have one duplicate row.

c. SQL to find duplicate rows

The SQL to find duplicate rows in a table is not the same as checking for duplicates in a column.

Ideally, if the database table has the right combination of key columns, you should not have duplicate rows. Regardless, if you are suspicious that your table has duplicate rows, perform the below steps.

  1. Determine they Key columns on your table.
  2. If the table does not have keys defined, determine which column(s) makes a row unique. Often times this depends on the functional use case of the data.
  3. Add the fields from Step 1 or Step 2 to your SQL COUNT(*) clause. 

Using the USER_DIET table above, lets assume no keys were defined on the table. Our next option would be determining which column(s) makes a row unique.

Note that the table has 3 rows. If Sam or Jon had the same fruit more than once on the same day, this would create a duplicate row.

Could Sam or Jon eating different fruits on the same day be considered a duplicate row?

The answer – Maybe!  It depends on the functional use case of the data.

The SQL to find duplicate rows syntax is as shown below.

SELECT name, fruit, day, count(*) from user_diet
GROUP BY name, fruit, day
HAVING count(*)>1;

2. SQL Find Duplicates using MINUS function

The MINUS function works on two tables ( or datasets) and returns rows from the first table that does not belong in the second table. This option using the MINUS function in SQL, to find duplicates, is specific to Oracle. Use it for awareness and to validate your results using the count(*) method. 

Find duplicates using MINUS function and rowid

SELECT name, rowid FROM user_diet
MINUS
SELECT name, MIN(rowid) FROM user_diet
GROUP BY name; 
Output from SQL statement: 
NAME COUNT(*)
Sam 2

 

ROWID is a pseudo column in Oracle and contains a distinct ID for each row in a table.

The first select statement (before the MINUS function) returns 6 rows containing NAME and a distinct value for the ROWID column.  The second select statement on the other hand returns 2 rows, one for Sam and one for John. Why do you think that is ?

It’s because of the min function on the ROWID column.

The final output contains the “actual” number of duplicate rows, and not the total number of rows like the count(*) function.

Find duplicates using MINUS function and rownum

SELECT name, rownum FROM user_diet 
MINUS
SELECT name, rownum FROM
(SELECT DISTINCT name FROM user_diet);
Output from SQL statement: 
NAME COUNT(*)
Sam 2


In this second example, we used ROWNUM, which is a pseudo column used to uniquely identify the order of each row in a select statement. 

So, what’s the difference between ROWNUM and ROWID in our example? 

They are both pseudo columns in Oracle.

ROWNUM is a number and is generated on the result of the SQL statement. ROWID on the other hand is associated with each row of a table. 

3. Find Duplicates in SQL using Analytic functions

Analytic functions are used to perform calculations on a grouping of data, normally called a “window”. This technique can be a bit confusing if you are just starting off with SQL, but it’s definitely worth knowing.

SELECT name, ROW_NUMBER() OVER ( PARTITION BY ssn ORDER BY ssn) AS rnum 
FROM user_diet;






Output from SQL statement: 
NAME RNUM
John 1
John 2
John 3
Sam 1
Sam 2
Sam 3

What are we doing here?

We are attempting to find if any duplicates exist for the column NAME.

Let’s break down this SQL and make sense of it. 

The function ROW_NUMBER() assigns a number starting at 1 to the rows returned by the PARTITION window.

In our case, since we partitioned our dataset on the NAME column, we have 2 datasets: one for Sam and one for John. ROW_NUMBER() now assigns a unique number to each of the 3 rows for Sam, resets the counter and then does the same for John.

The resulting output is as shown on the right side of the query.

One of the reasons I love this technique is because I can turn the above SQL into a nested subquery and get a distinct set of records as shown below.

SELECT name FROM (
SELECT
 name, ROW_NUMBER() OVER ( PARTITION BY name ORDER BY name) AS rnum FROM user_diet)
WHERE rnum = 1;

Conclusion

A final tidbit, SQL is not limited to transactional databases.

Apache Spark has a module called Spark SQL to handle structured data. AWS Athena even lets you write SQL against files!

The demand for SQL skills is endless. So play around with what you learned here. Try selecting multiple columns, switch the PARTITIONS, change the SORT order. Practice is the best way to master something !

SQL helpful links

Oracle SQL Reference

Oracle SQL syntax, expressions and function reference

MySQL Reference Manual

For MySQL related syntax, statements and examples

T-SQL Reference

For SQL Server transact SQL functions, examples and syntax

Table of Contents

Interested in our services ?

email us at : info@obstkel.com

Copyright 2022 © OBSTKEL LLC. All rights Reserved