Minimizing the Effects of Bad Data in Big Data With Data Governance
We live in a time where corporate data is growing at an unprecedented rate and given current technology trends, this growth can only accelerate. Many industry leaders have already arrived at the conclusion that when it comes to data, bigger is certainly better. Vendors have also fueled this thinking by aggressively promoting the idea that with more data, one can derive better insights and in turn, drive better business outcomes. Consequently, many companies have now begun to amass data from every possible source. Even though they may be unsure about its veracity, they fear that by not doing so, they may be missing out on a potential competitive advantage in the future.
Digital platforms make it straightforward for businesses to gather insights from interactions with customers, both in the offline and online world. These data sets in many cases include transactional data that is augmented with additional dimensions like geographic, demographic and psychographic data, which are sourced from second and third-party data sources. This also serves a strategic business objective as the Holy Grail for most businesses is to have a rich, up-to-date profile of their customer to foster highly personalized relationships. Typically managed within massive enterprise repositories like Data Lakes or Dataware houses, this Big Data drives many key downstream processes for customer engagement including digital marketing, CRM and loyalty programs. But the key question is, does all this Big Data always lead to better business outcomes? The devil is in the details, as they say.
In today’s business world, data management has mushroomed in its scope and complexity and must be looked at holistically sense. Unlike the past, where data originated from known sources, a significant percent of Big Data entering enterprise repositories is obtained from myriad sources of indeterminate pedigree. Given the massive volume and constant flow, few companies have the capability to test data quality by sampling more than 10 percent of the data entering into their systems, according to industry surveys.
Ensuring the data accuracy, consistency and completeness at the intake stage is a critical differentiator for any successful Big Data program. Bad data can also come in due to unsophisticated segmentation techniques where customers may have incorrect attributes attached to their profiles. The issue with bad data is that it has a ripple effect into overall decision making and can easily destroy hard won customer relationships. It does not matter how many smart Data Scientists a company may have because the quality of their insights is only as sound as the data they are relying upon. I remember seeing a cartoon of a much tattooed biker lamenting about how he had made the mistake of buying his little niece a pink pony stuffed toy online. Since that purchase, his email inbox had been flooded with offers for girls clothing, cosmetics and even baking supplies.
The first line of defense for ensuring quality data still lies in ensuring that your company has a well-defined Data Governance program. Simply put, this implies having a framework with a clearly defined set of policies and best practices, through which the company collects, stores, secures and uses data. Through governance, it also becomes easier to trace data lineage (i.e. define controls on where the data comes from) which is imperative especially for a Big Data initiative. Governance can also help standardize relevant technologies especially suited for Big Data scenarios as recent advances in streams-based data quality tools can significantly help in accelerating data delivery cycles.
One emerging area for Data Governance especially in the context of Big Data, is particularly interesting: governance of machine learning algorithms and their training data sets. The trends for the foreseeable future indicate machines rather than humans, will increasingly be making decisions for all data intensive business processes. In digital marketing, this could apply to the online advertising and promotions consumers receive. The paradigm shift relies on machines learning how to make good decisions based on data patterns, rather than following steps explicitly encoded by human programmers, as has been the case thus far. Here, governance will be imperative to remove bias from the decision process by ensuring that relevant data sets are used to train the machine learning algorithms, besides ensuring that data accuracy is high. Consider one recent highly publicized scenario of female job seekers on the Google network being much less likely to be shown online ads for highly paid jobs, as opposed to their male counterparts. We have to be careful about these unwitting biases presumably in bad training datasets as they lead to bad decisions and alienate future prospects and customers. In fact, governance may also mandate that the training sets be archived in the event of a future audit of the decision algorithm to ensure there was no bias.
While recent technology advances offer companies the opportunity to understand their customers like never before, the insights are only as good as the quality of data. In the era of Big Data, it’s actually about better data, not bigger data. Good data governance practices offer a practical way to significantly reduce the risk and challenges and needs to be the cornerstone of every enterprise Big Data initiative.