I’m not an expert on Big Data, but have been involved in a couple of projects using some of the newer technologies. Also, I’ve been keeping one eye on the media and looking for insights where ever I can.
So, I thought I would start to write down some of my thoughts on Big Data and Data Quality; but first a brief introduction.
(A cautionary note: there seems to be few solid conclusions about Big Data results, hence some of the comments are a little vague.)
Big Data, as it's name suggests is the management of very high volumes of data (petabyte, 1015 and Exabyte, 1018 bytes in size) and learning how to extract value to increase revenues and profits.
The Big Data industry seems to have adopted Gartner’s 3Vs definition for describing key characteristics for Big Data:
1. Volume – Amount of Data
2. Velocity – Speed of Data
3. Variety – Differing Types of Data
First thing to note is, not all businesses have very high volumes of data, quite the opposite, most businesses are small to medium. However, the range of data (or Variety) available to everyone is far greater (particularly if you use social media.) There are many new companies and new tools to extract information from social media, but the larger business case of generating greater revenues for all still needs to be proven.
It’s not just the private sector that’s involved with Big Data, Government, Science and Research establishments all make good use of some of the new concepts and technologies.
Focusing on commerce, the most important use of Big Data is in real time predictions and decision making. Can we find greater insights into buyer habits with all this extra data and then sell more goods/services and/or use these insights to retain them as customers for longer. What is most exciting is the possibility of unearthing new marketing techniques for greater growth.
Because we have a new concept, it does NOT mean the traditional approach of analysing historical data is no longer useful. There is true value in historical data, and this value can be used to make precise and valid predictions. Real time data can add more value on top, but the insights are very likely to be company specific. Hence, drawing big marketing conclusions from the use of Big Data concepts/technology maybe a stretch at the present moment.
The tried and tested marketing approaches of:
1. Providing the right product/services to the right person at the right time; and
2. Speed and Convenience
will continue for a very long time.
So how does data quality fit into the Big Data model? The short answer is that we don’t fully know. While businesses are innovating with Big Data solutions, a few case studies on data quality are beginning to appear.
For the traditional database, we know that poor data will affect results, it’s been proven many times and ROI calculations are formed easily.
How do you solve data quality problems in a real time environment? I’ve heard some commentators mention that data cleansing is not required and that poor data can simply be ignored – this just doesn’t seem right.
Duplicates have always caused problems, and if ignored will give you incorrect results. Even in a real time environment you are likely to have duplicate information. So if Big Data is about making predictions using vast data stores, then you will have incorrect decisions. But how can you clean data in real time environments to the same degree you can with traditional databases. At this moment you can't, I guess new techniques will be needed for cleansing and for deciding if data is of good quality of poor quality. It will most likely rely on automated or robot techniques.
Thinking from a business point of view Big Data solutions will be specific to its application, some of the underlying technologies may be the same for everyone, but solutions and data quality are likely to be very domain dependent.
In the next post I’ll look at some of the technologies for Big Data and how it differs from traditional databases.