Cleaning up duplicate records before they cost you

Duplicate records are the most expensive data problem that no one talks about. They do not cause a visible system error. They cause invisible, compounding problems: a donor receives the same appeal twice and marks your email as spam; a volunteer shows as “not checked in” because they were checked in on a second record; your headcount report says 1,400 active members when the real number is 1,100. None of these failures announce themselves. They just quietly erode the trust of the people in your database and the accuracy of your decisions.

How duplicates form in the first place

Duplicates are not usually the result of carelessness. They are the result of systems that have no deduplication logic and multiple entry points. Understanding the source tells you where to put the prevention.

Bulk imports without a match key. A spreadsheet import that checks for duplicates by exact name match will create new records for “John Smith” and “John A. Smith” even when they are the same person. Without email as a match key, imports are duplicate factories.
Web forms with no existing-record check. Every time someone fills out an event registration, a giving form, or a contact form, a new record is created if the system does not check first whether that email already exists.
Manual entry by multiple staff members. Staff member A adds a new attendee on Sunday morning. Staff member B does not check and adds the same person Monday afternoon with a slightly different spelling.
System migrations and integrations. When you move from one platform to another, or when two systems sync, records that were unique in each system become duplicates in the merged dataset.
Name changes. A person gets married, changes their name, and a well-meaning staff member creates a new record for the new name rather than updating the existing one.

Finding duplicates: the fuzzy match approach

Exact-match deduplication finds the obvious cases (identical name and email). Fuzzy matching finds the real problem: records that represent the same person but differ by a typo, a missing middle initial, or a maiden name. Here is how to run an effective fuzzy match.

Start with email as the primary key Run a report of all email addresses that appear on more than one record. These are your most certain duplicates. Email is the most reliable identifier because most people have one primary address they use consistently.
Match on phone number as the secondary key After email, mobile phone numbers are the second most reliable identifier. Run the same report for phone numbers appearing on multiple records. Strip formatting first (remove spaces, dashes, parentheses) so “555-123-4567” and “5551234567” match.
Run a name-plus-address fuzzy match For records with no email or phone overlap, match on a combination of last name (exact) plus mailing address (fuzzy: same street number and ZIP). This catches records from the same household that have different email addresses.
Flag for human review, do not auto-merge Fuzzy match produces candidates, not certainties. A list of “likely duplicate” pairs should go to a staff review queue. A human confirms each pair before anything is merged. Auto-merging fuzzy matches will eventually merge two different people and corrupt both records.

Before and after: a merge example

Before merging, both records contain partial truth. After merging, one record contains the best available data from both. Here is what that looks like in practice.

Field	Record A (older)	Record B (newer)	After merge (keep)
Name	Maria Reyes	Maria E. Reyes	Maria E. Reyes (more complete)
Email	maria@gmail.com	mariareyes@work.com	Both (primary: gmail)
Phone	555-210-0044	(blank)	555-210-0044
Address	412 Oak St, 78201	(blank)	412 Oak St, 78201
Giving history	3 gifts totaling $450	(blank)	Retained on merged record
Last interaction	2023-11-14	2024-02-03	2024-02-03 (most recent)
Created date	2021-04-22	2024-02-03	2021-04-22 (original, older)

The rule for every field: keep the more complete value. For dates, keep the oldest created date (it reflects when they first entered your system) and the most recent interaction date (it reflects their current engagement). For transactional history (giving, attendance, event registrations), merge everything onto the surviving record. Never discard transaction history.

Merge rules: what to keep, what to discard

Before you run a merge, agree on the rules. Inconsistent merges create new data problems. Document these rules and apply them the same way every time.

Always keep the older “created date.” The surviving record should reflect when the person first entered your system, not when the duplicate was created.
Always merge all transaction history to the surviving record. Giving, attendance, event registrations, notes, communication history. None of it should be discarded.
For conflicting contact info, keep the more recently updated value. If one record has an address updated in 2024 and the other has one from 2020, keep 2024.
For conflicting names, keep the more complete version. “Maria E. Reyes” over “Maria Reyes.” Add the variant as an alternate name field if your system supports it.
Log every merge. Record who merged, when, and which record ID was retired. If a merge was done in error, you need a trail to undo it.

Preventing recurrence: validation and dedupe on entry

A cleanup is temporary if you do not fix the entry points. The same sources that created duplicates before will create them again after your cleanup if the underlying logic has not changed.

Require email on all web forms. A form submission without an email cannot be matched to an existing record. Make email required and use it as the lookup key before creating any new record.
Set import rules: match on email first, then name plus address. Any import tool should be configured to check for existing records before creating new ones. An unmatched record should go to a review queue, not auto-create.
Show staff a “possible match” prompt during manual entry. When staff enter a new person, the system should surface any records with similar names or matching contact info before the new record is saved. A 30-second review is cheaper than a 30-minute merge later.
Normalize data on entry. Strip extra spaces from names. Format phone numbers consistently. Title-case or lowercase names automatically. Inconsistent formatting is the leading cause of failed exact-match deduplication.

A cleanup cadence that stays manageable

A one-time cleanup followed by nothing will return to baseline within 12 months. Build a recurring cadence so the problem never grows large enough to require another full cleanup.

After every bulk import: run the email-match and phone-match reports immediately. Resolve flagged duplicates before they enter normal workflows.
Monthly: run the fuzzy name-plus-address report and review the queue. 30 minutes once a month prevents the six-hour quarterly cleanup.
Annually: run a full database audit. Pull every record with no email, no phone, and no interaction in 12 months. These are likely stale duplicates or incomplete entries. Archive or delete after review.

Key takeaways

Duplicates form at imports, web forms, manual entry, and migrations. Fix the entry point or the cleanup is temporary.
Use email as the primary deduplication key, then phone, then name plus address. Run each as a separate pass.
Never auto-merge fuzzy matches. Flag for human review. Auto-merging eventually merges two different people.
Keep the oldest created date and all transaction history on the surviving record. Keep the most recently updated contact info.
Normalize data on entry: consistent name casing, stripped phone formatting, required email. Inconsistent formats defeat deduplication.
A monthly 30-minute review queue is better than a 6-hour annual cleanup.

Common questions

How do I handle two real people with the same name and the same address, like a parent and adult child?

Different email addresses or phone numbers are almost always sufficient to distinguish them. If they truly share all contact info, add a birth date or a middle initial to the core record to create a unique identifier. Flag these records explicitly so staff know the similarity is intentional, not an error.

What should I do with records I cannot confidently confirm are duplicates?

Leave them separate and tag both as “possible duplicate, needs review.” Do not merge without confirmation. A wrongly merged record combines two people’s history in a way that is very difficult to undo. When in doubt, wait.

Is there a safe way to bulk-merge a large number of confirmed duplicates?

Yes, if your system supports it. Export the confirmed duplicate pairs to a spreadsheet with “keep” and “retire” record IDs. Process the merges in batches of 50 to 100 rather than all at once. This limits the damage if a batch has an error, and it gives you a chance to review the merged records before the next batch runs.

How long does a full database cleanup take?

For a database of 1,000 to 5,000 records, a full cleanup takes 4 to 8 hours if you have a staff member dedicated to it and a tool that surfaces candidates for review. For 10,000 or more records, plan for two to three days spread over two weeks. The time is mostly in the human review queue, not the merge operation itself.

The takeaway. Duplicates are a process failure, not a data failure. They form because entry points have no deduplication logic. Fix the sources, run a cleanup, and build a monthly review habit. The hour you spend each month on the review queue saves the six hours you would spend on the next full cleanup, and it protects the people in your database from the invisible harm that bad data causes.