banner

For a full list of BASHing data blog posts, see the index page.     RSS


Partial duplicates

In the business world, duplicate records are only rarely true duplicates. More often they're multiple records for the same customer (for example) with spelling differences in the customer's name, or differences in the formatting of the customer's address.

Finding these "pseudoduplicates" can be difficult. OpenRefine might help, and there are many commercial services which offer deduplication at a price.

Another way to view pseudoduplicates is to see them as partial duplicates. Instead of looking for parts of a record that might be the same as parts of another, you could screen for combined parts of a record that are the same, but perhaps shouldn't be.

Here's an example. The 19 records in the tab-separated table below (called "file") are each unique, even if we ignore the record ID field (see screenshot).

IDField1Field2Field3Field4Field5
10001be-29b2-4aae-9e61-d4d019adef79
100027b-ea6a-426e-bae2-45ed2c358c6a
1000339-cfd4-4318-875c-e7cbe71417f2
1000442-ab7b-4947-96fd-1036c98a1235
10005db-3f71-49a9-b24c-5a1fca079c31
100069f-aff0-4df5-8ea3-b5a092755912
10007d3-1c64-4a57-baa0-68804191270a
10008be-17c8-4d7e-97ee-ec55168faf14
10009b6-4741-4e3d-b681-4fcbe71417f2
10010b4-079e-47e8-be1e-0ea9999a8cd7
100119c-f8bf-4da7-a400-2519042a98df
10012c8-8f21-4239-bfea-2b82d21a1cbd
10013ae-aad7-4204-a351-c1851871300b
10014cd-b571-49a9-bc6d-eb97b49401b9
1001534-bad4-4318-8df9-abcbe71417f2
10016af-c2d1-4dfd-a589-87edfbe7d380
10017a8-3671-49a9-ba15-c21fca079c31
10018ff-7871-49a9-b701-9e6a8470122b
1001989-bbd4-4318-8f9d-60cbe71417f2

unique

I know, however, that there are partial duplicates lurking in this file, in the combined fields 3 and 5. The equivalent in a customer database might be "same address and phone number, but differently spelled name". To find these partial duplicates, I use one of two AWK commands. One reads the file twice and doesn't use much memory, and is suited to files of any size. The other command reads the file once; with big files it uses a lot of memory.

The two-pass command puts the field 3/field 5 combination in an array during the first pass through the file, and counts the number of occurrences of each combination. In the second pass, AWK matches the field 3/field 5 combination in the current line to the array, and prints that line if the array count is greater than 1:

awk -F"\t" 'FNR==NR {a[$3,$5]++; next} a[$3,$5]>1' file file

2-pass1

A follow-up sorting organises the output more neatly:

awk -F"\t" 'FNR==NR {a[$3,$5]++; next} a[$3,$5]>1' file file \
| sort -t $'\t' -k3 -k1

2-pass2

The one-pass AWK command puts every line into the array 'a', indexed by the field 3/field 5 combination. If the current line's fields 3 and 5 are already in 'a', the corresponding stored line is set equal to a variable 'b', and 'b' is printed followed by the line currently being checked. The next time there's a successful check of the array, 'b' is reset.

awk -F"\t" 'b=a[$3,$5] {print b"\n"$0} {a[$3,$5]=$0}' file

1-pass1

The output of the one-pass command needs uniquifying as well as sorting:

awk -F"\t" 'b=a[$3,$5] {print b"\n"$0} {a[$3,$5]=$0}' file \
| sort -t $'\t' -k3 -k1 | uniq

1-pass2

In a real-world case recently, I had to modify the two-pass command because many of the records had blanks in the two fields being checked, which made them "the same". I checked for blanks after noticing that the count of partial duplicates was suspiciously large:

2-pass3

The modification was to add another condition to the second part of the AWK command, namely that field 2 isn't blank:

2-pass4

The two pseudoduplicated records have the same collection codes ("NM" in field 17) and catalog numbers ("E239" in field 2). Apart from different strings for the unique record ID (field 1) and the language of the institution name (field 16, "Národní muzeum" vs "National Museum of the Czech Republic"), the records are identical.

The 2-pass and 1-pass AWK commands make finding partial duplicates easy. The hard part of the job is deciding which fields to check!



To avoid having to write the two-pass command from scratch every time I use it, I've saved its skeleton in the function "coldupes", which uses the xclip utility:

coldupes() { echo -en "awk -F\"\\\t\" \x27FNR==NR {a[]++; next} \
a[]>1\x27 $1 $1 | sort -t \$\x27\\\t\x27 -k" | xclip; }

I enter "coldupes" and the filename in a terminal and press Enter. I then middle-click-paste at the next prompt to get the skeleton command, ready for adding the fields to check and the sort instructions:

coldupes

Last update: 2018-07-14