- Data duplication during extraction happens when the same records are pulled multiple times from the source system into the target dataset.
- It usually occurs due to incorrect incremental logic or missing unique keys.
- In one ETL validation, daily orders were duplicated because extraction used order_date instead of last_updated_timestamp.
- So updated orders were reloaded every day.
- I identified it when total revenue suddenly doubled in the dashboard.
- We fixed it by using primary key + last modified date and applying deduplication logic.
- This is important because duplicates impact KPIs like sales, customers, and conversion rate.
- As a BA, I validate record counts and totals after each load to catch it early.
What is data duplication during extraction?
Updated on February 5, 2026
< 1 min read
