PostgreSQL Tutorial
Data Types
Querying & Filtering Data
Managing Tables
Modifying Data
Conditionals
Control Flow
Transactions & Constraints
Working with JOINS & Schemas
Roles & Permissions
Working with Sets
Subquery & CTEs
User-defined Functions
Important In-Built Functions
PostgreSQL PL/pgSQL
Variables & Constants
Stored Procedures
Working with Triggers
Working with Views & Indexes
Errors & Exception Handling
Deleting duplicate rows in PostgreSQL using a subquery involves identifying the duplicates based on certain criteria and then eliminating the unwanted rows. Let's consider a simple example to demonstrate the process.
Suppose we have a table named students
with columns id
, name
, and age
, and there might be multiple rows with the same name
and age
, but different id
values. We want to keep one unique row for each combination of name
and age
and delete the duplicates.
Here's how you can achieve this:
Using a Common Table Expression (WITH clause):
WITH duplicates AS ( SELECT name, age, ROW_NUMBER() OVER (PARTITION BY name, age ORDER BY id) AS row_num FROM students ) DELETE FROM students WHERE id IN (SELECT id FROM duplicates WHERE row_num > 1);
In this method, we use the ROW_NUMBER()
window function to assign a unique number to each row within the partitions of name
and age
. Any row number greater than 1 indicates a duplicate.
Using a Subquery without a CTE:
DELETE FROM students WHERE id NOT IN ( SELECT DISTINCT ON (name, age) id FROM students ORDER BY name, age, id );
Here, DISTINCT ON
selects one id
for each combination of name
and age
, prioritizing the lowest id
value because of the ORDER BY
clause. The outer query then deletes all rows that don't have those selected id
values, thus removing duplicates.
Using JOIN with a Subquery:
This method involves joining the table with a subquery that identifies the rows to be retained:
DELETE FROM students s1 USING ( SELECT name, age, MAX(id) as max_id FROM students GROUP BY name, age HAVING COUNT(*) > 1 ) s2 WHERE s1.name = s2.name AND s1.age = s2.age AND s1.id <> s2.max_id;
In the subquery, we identify each combination of name
and age
that occurs more than once and select the maximum id
for each group. The outer DELETE query then deletes all rows for these groups except the ones with the maximum id
.
Before running the DELETE statement, it's always a good idea to run a SELECT statement using the same criteria to verify that you're targeting the correct rows. Also, make sure you have backups or use transactions to safeguard against unintended data deletions.
Deleting duplicate rows with a subquery in PostgreSQL:
DELETE FROM your_table WHERE id NOT IN ( SELECT MIN(id) FROM your_table GROUP BY duplicate_column );
Using subqueries to identify and delete duplicate records in PostgreSQL:
DELETE FROM your_table WHERE id IN ( SELECT id FROM your_table GROUP BY duplicate_column HAVING COUNT(*) > 1 );
Finding and removing duplicates with correlated subqueries in PostgreSQL:
DELETE FROM your_table t1 WHERE EXISTS ( SELECT 1 FROM your_table t2 WHERE t1.duplicate_column = t2.duplicate_column AND t1.id < t2.id );
Deleting duplicates based on specific columns in PostgreSQL:
DELETE FROM your_table WHERE (column1, column2) IN ( SELECT column1, column2 FROM your_table GROUP BY column1, column2 HAVING COUNT(*) > 1 );
Handling NULL values when deleting duplicates in PostgreSQL:
DELETE FROM your_table WHERE (column1, column2) IN ( SELECT column1, column2 FROM your_table WHERE column1 IS NOT NULL AND column2 IS NOT NULL GROUP BY column1, column2 HAVING COUNT(*) > 1 );
Deleting duplicates from multiple tables with a subquery in PostgreSQL:
DELETE FROM table1 WHERE (column1, column2) IN ( SELECT column1, column2 FROM table1 JOIN table2 ON table1.id = table2.id GROUP BY column1, column2 HAVING COUNT(*) > 1 );
Using EXISTS in subqueries for deleting duplicates in PostgreSQL:
DELETE FROM your_table t1 WHERE EXISTS ( SELECT 1 FROM your_table t2 WHERE t1.duplicate_column = t2.duplicate_column AND t1.id < t2.id );
Avoiding common pitfalls when deleting duplicates with a subquery in PostgreSQL:
-- Check for NULL values and use appropriate conditions to avoid unintended removal