In many organisations you will find islands of data that should be stored centrally, particularly if a central database or CRM application exists. The nature of how many of us work means that we have spreadsheets, our own custom database or Outlook contacts which ideally should all be stored centrally.
For every island of data, there are an equal number of reasons (or excuses) for their existence, ranging from not wanting others to see the data to just forgetting to import the data in the central database.
In this article is a simple methodology for merging multiple sources of data and creating a single central database.
Step 1: Normalisation
1. Collect all the data from the different sources
2. Convert them into the same format, say Microsoft Excel – Each Excel
file will different fields and naming conventions
3. Create a database definition for the new central database - Decide
which fields you want to store
4. For each Excel file mark the columns into 3 categories:
a. Columns that are be used to match to other Excel files (these
columns are common to all the data sources)
b. Columns that have a column in the central database but will not be
used to match with
c. Columns that will be discarded
5. Add a primary key for each Excel file – Add a column with a record
6. Merge all the Excel files into one file
a. Ensure all the matching fields are in the same column (Columns from
b. Ensure columns from 4.b are in the file
c. Add a source column as you add each Excel file to the central one
d. Add a primary key for this new central file
At this point we have a central data base with all the fields we want and we know which records are from which data island.
Step 2: Standardisation & Improvement
Every field is formatted to the same standard. Each field should be checked for valid characters, whether the data matches to a format structure or matches to a value in a lookup list.
Additionally, certain information can be verified against national reference files. For example, UK names can be checked against the Electoral Register, global business can be validated against Dun & Bradstreets’ global business file.
If you have internal business rules these can be applied to the data.
After this stage the data can be considered to be clean.
Step 3: Matching
We are now ready to match all the records with each other. Refer to our blog on deduplication to learn how this process works. The result from the process will be a list of records that are clustered together that have been matched. The records are considered to be the same information.
To decide which record should be the master (golden) record, a set of business rules combined with the data quality of the records can help identify it. Our article on golden records further describes this process.
As we have the source field, we know where the master record has come from. This can be really useful, knowing which data source has the most important data.
Step 4: Merging
For each of the clusters identified in step 3, the data can be merged into one master record. Apart from the master record, the other records in a cluster can now be deleted.
The above process allow you to remove the islands of data, it doesn’t matter on the number of records or the number of data sources you have. Once the data is centralised, a combination of technology, processes and user training will ensure data is kept centrally.
If you would like further information on removing data islands within your business then contact Acuate:
|Telephone:||+44 (0) 20 8799 0360|