The DISTINCT keyword in SQL is used to eliminate duplicate values from the result set, ensuring that only unique records are returned. Itβs commonly applied when we want to find all the different values in a column or combination of columns.
For example, consider a customers table:
| customer_id | city |
|---|---|
| 1 | Chennai |
| 2 | Bangalore |
| 3 | Chennai |
If we want to get the unique cities where customers are located, we can use:
SELECT DISTINCT city
FROM customers;
Result:
| city |
|---|
| Chennai |
| Bangalore |
We can also use DISTINCT on multiple columns:
SELECT DISTINCT city, country
FROM customers;
This returns unique combinations of city and country.
In a real-world scenario, I used DISTINCT while preparing a report of unique products sold across different regions. Initially, the raw sales table had multiple entries for the same product in the same region. Using DISTINCT simplified the dataset and made the reporting more readable.
A challenge Iβve faced is that DISTINCT considers all selected columns together. So, if even one column in the row differs, it treats the row as unique. This sometimes led to unexpected results when I only cared about one column but included others in the SELECT clause. The solution was either to select only the relevant column or use GROUP BY for aggregation.
A limitation of DISTINCT is performance β on large tables, removing duplicates requires additional sorting or hashing, which can slow down queries.
As an alternative, for large datasets where performance is critical, I sometimes use GROUP BY instead of DISTINCT:
SELECT city
FROM customers
GROUP BY city;
This achieves the same result and can be optimized by the database more efficiently in some cases.
So, in summary β DISTINCT is a simple and effective way to retrieve unique records, especially useful for analysis, reporting, and eliminating duplicates.
