Part 7: This is the seventh of eight posts on how to measure data quality. This post describes why duplication is a good measure and how it’s used.
Duplication refers to business entities that appear more than once but are the same. For example, a contact may appear twice or a company may appear several times. There are many different reasons why duplicates occur, but nearly all databases that I have analysed, always have duplicates.
Why are duplicates a problem? Let’s look at a couple of examples. In a B2B organisation if a company appears as two different accounts, then each account may have its own order lines, its own contacts, its own activities, and so on. From an account management point of view, it will better to see both records as one company so that the account manager can better develop the account and have a full picture of what’s going on.
For a B2C organisation, duplicate people will result in the individual receiving duplicate mailings, which is not only a cost issue, but a reputation issue. How many times have we all received the same mailing twice.
Duplicates are very costly to a business and they also erode confidence in those using the application where the duplicates are found.
Removing duplicates is an Art and a Science. Removing duplicates isn’t as easy as it may appear. There’s isn’t enough time here to run through all the methods one can use, but it does warrant a much deeper level of explanation.
Duplicates are measured as a percentage of the overall number of records. There can be duplicate individuals, companies, addresses, product lines, invoices and so on. Depending on the type of information, the ROI case for removing them will vary. Clearly having two invoices can cause problems, having two address can cause expensive shipments to be miss-delivered or even worse, two shipments are made.
The other dimensions:
- Completeness (Part 2 of 8)
- Accuracy (Part 3 of 8)
- Consistency (Part 4 of 8)
- Conformity (Part 5 of 8)
- Currency (Part 6 of 8)
- Integrity (Part 8 of 8)