Searching for duplicates records can be a pain. Software applications use certain algorithms which give a certain result, but how do you know what results you have, and what error rates have occurred? We have all found duplicates from software applications that aren’t really a duplicate, or we have found a duplicate that wasn’t highlighted as one. Then there is the problem with speed, where the algorithms can take too long to find all the answers. So what strategy can help you with these problems?
In this article, a method of finding duplicates is introduced. Please click on the banner at the bottom to download the whitepaper on The Science of Deduplication.
This method focuses on:
- Finding Duplicates
- Finding Non-Duplicates
To visualise this method, imagine we have a database and we want to compare every record with every other one in order to find duplicates. Having 7 records will lead to 49 (=7x7) comparisons. Firstly, let’s remove some redundancy; if record 1 is compared with record 5 then we don’t need to compare it the other way round, as it is the same comparison. Also, we do not need to compare a record with itself, for obvious reasons.
This then leaves us with only 21 record comparisons to be made (see diagram below). If we can find duplicates and non-duplicates with a strong sense of certainty, then we have a group of record comparisons that we are happy about. We can look to minimise the uncertain area, which we can measure! Knowing the exact status of duplicate comparisons allows us to decide whether it is worth pursuing the search for more duplicates or to accept the current error rate. In some cases that error rate can be zero. One can then continue with automatic searching or seek a manual process to find the remaining duplicates.
Now lets look at this diagram, which outlines the comparisons of records to be made for a database with 7 records:
Duplicate comparisons are made within the light grey squares (dark ones are redundant comparisons). D marks a duplicate (e.g. record 3 is a duplicate along with record 1) and N marks a non-duplicate (e.g. Records 6 and 4 are definitely non-duplicates). Out of the 21 possible comparisons, 3 are duplicates (14%), 12 are non-duplicates (57%) and the remaining ones (6) are undetermined (29%). So we are sure about 71% of the comparisons made but there could be more in the remaining 29%. We can write more Deduplication routines and eliminate them or we can manually check them.
Looking at the duplicate searching process this way allows us to measure the uncertainty and know that we have found all the duplicates or at least a high percentage of them. View the whitepaper on The Science of Deduplication by clicking the banner below.