Home/Articles/Data Deduplication in Large-Scale Migrations

Solving the Duplicate Dilemma: Data Deduplication in Large-Scale Migrations

How normalization, deterministic vs. fuzzy matching, survivor rules, and continuous upsert logic combine to produce a single authoritative "Golden Record" from messy, overlapping flat files.

9 min read·Data Onboarding Fundamentals

In enterprise data management, the most persistent hurdle to a successful data migration strategy is the proliferation of redundant information. When organizations prepare for a high-volume bulk data import, they pull from multiple legacy sources — almost all of which export via CSV file. Because a comma-separated values file is untyped and carries no unique constraints, duplicates accumulate silently over time across every system that accepts flat-file input.

The "Duplicate Dilemma" is this: the same real-world entity appears multiple times in your flat files with slight variations. A regional sales office CSV lists a client as Acme Corp, Inc. — the marketing department's flat file calls the same company Acme Corporation. When combined for a bulk upload, the target system treats them as two distinct customers. This fragments analytics, corrupts reporting, and breaks MDM master data management at the foundation. A distorted view of the customer count leads to wasted marketing spend, duplicate outreach, and business intelligence that cannot be trusted.

Solving it requires a four-step workflow. For a format primer, see what is a CSV file.

Step 1: Normalization as a Prerequisite for Deduplication

You cannot identify duplicates in data that is not standardized. Normalization must happen first — before any matching logic runs — because two records that represent the same entity will only match if their fields are in the same canonical format:

# Same city, three representations — will NOT match without normalization
Location
New York, NY
NYC
new york

# Normalized — all three collapse to one canonical value
Location
New York
New York
New York

# Same phone number, two representations — will NOT match
phone
(555) 123-4567
5551234567

# Normalized — strip all non-digit characters first
phone
5551234567
5551234567

Atomic parsing and format standardization create the level playing field where matching can actually occur. Without this step, fuzzy matching will miss obvious duplicates and flag false positives — making every subsequent step unreliable. For the full normalization toolkit, see Data Normalization: Raw CSVs into Clean Records.

Step 2: Deterministic vs. Probabilistic Matching

With normalized data, two matching strategies are available. They are not mutually exclusive — both should run in sequence:

StrategyHow it worksWhen to use
DeterministicExact match on a unique identifier (SSN, UPC, EIN, email)When reliable unique keys exist across source files
Probabilistic (fuzzy)Confidence score across multiple fields (name, address, phone)When unique keys are missing, corrupted, or inconsistent

Deterministic matching handles the easy cases. Probabilistic matching handles the hard ones — the scenario where Johnathan Smith at 123 Main St and John Smith at 123 Main Street are the same person. Fuzzy algorithms compute a confidence score across the name similarity (high), address similarity (high), and any other overlapping fields to surface a likely match. Records that exceed a confidence threshold are flagged for merge; those in a middle range are queued for human review.

Record A:  Johnathan Smith  |  123 Main St    |  555-0199
Record B:  John Smith       |  123 Main Street |  555-0199

Name similarity score:   0.87  (Levenshtein / Jaro-Winkler)
Address similarity:      0.95  (abbreviation-aware comparison)
Phone match:             1.00  (exact after normalization)

Composite score: 0.94  →  MERGE (above 0.90 threshold)

This produces a "Golden Record" — a single authoritative version that merges the most accurate fields from both source rows. In MDM master data management, the Golden Record is the core deliverable of the deduplication workflow. For the AI-assisted mapping that integrates with this process, see AI-Powered Data Mapping.

Step 3: Staging, Survivor Rules, and Cross-File Analysis

Deduplication must never run directly in the production environment. The data from raw CSV files moves into a staging area first — a temporary environment where cross-file analysis can occur before any production write. When importing three CSV files that all contain lead data, the staging area acts as a filter that identifies duplicates across all three files simultaneously before the final load executes.

When the staging layer identifies a duplicate pair, a Survivor Rule determines which version of each field to retain in the merged Golden Record:

-- Example survivor rule logic in a staging merge
UPDATE staging.golden_customers gc
SET
    -- Keep the more recent address
    address     = CASE
                    WHEN a.last_updated > b.last_updated THEN a.address
                    ELSE b.address
                  END,
    -- Keep the non-null email (prefer Record A if both present)
    email       = COALESCE(a.email, b.email),
    -- Sum loyalty points from both records
    loyalty_pts = COALESCE(a.loyalty_pts, 0) + COALESCE(b.loyalty_pts, 0)
FROM staging.record_a a
JOIN staging.record_b b ON b.match_id = a.match_id
WHERE gc.id = a.id;

Survivor rules encode business logic: the most recent address wins, non-null values are preferred, and additive fields like loyalty points are summed. This ensures the final record entering the normalized database is not only unique but the highest-quality version available across all sources. For the validation patterns that run alongside this, see Advanced Data Validation Strategies.

Step 4: Continuous Validation and Upsert Logic

Deduplication is not a one-time migration task — it must be integrated into the ongoing MDM master data management workflow. New duplicates enter the system continuously through manual entry, new bulk uploads, and third-party integrations. The validation layer must stay active after the initial migration.

Every new CSV import should trigger an upsert check against existing production records — update if the record already exists, insert only if it is genuinely new:

-- Postgres upsert: update existing record, skip insert if duplicate
INSERT INTO customers (email, first_name, last_name, address, updated_at)
VALUES ($1, $2, $3, $4, NOW())
ON CONFLICT (email)
DO UPDATE SET
    first_name = EXCLUDED.first_name,
    last_name  = EXCLUDED.last_name,
    address    = EXCLUDED.address,
    updated_at = NOW()
WHERE customers.updated_at < EXCLUDED.updated_at;
    -- Only update if the incoming record is newer

This ON CONFLICT ... DO UPDATE pattern (SQL Server equivalent: MERGE) ensures that re-importing a CSV with updated customer addresses enriches the existing record rather than duplicating it. Combined with an active probabilistic matching layer that runs on every new import, the system maintains a continuously clean, non-redundant database without requiring periodic manual deduplication audits.

Solving the duplicate dilemma is ultimately about the reliability of business intelligence. A migration that fails to deduplicate produces a database that provides a distorted view of reality — skewed customer counts, inflated pipeline metrics, and MDM records that cannot serve as a trusted single source of truth. See Data Cleansing vs. Data Scrubbing for the upstream preparation that reduces the volume of duplicates reaching the matching pipeline in the first place.

Import once, deduplicate automatically

Elvity runs normalization, deterministic matching, and upsert logic on every CSV import — so every new file enriches your existing records rather than fragmenting them.