Getty Images
Data analytics pipeline best practices: Data governance
Data analytics pipelines bring a plethora of benefits, but ensuring successful data initiatives also means following best practices for data governance in analytics pipelines.
While data and analytics fuel organizations forward, a vital aspect they must keep in mind is the impact of data governance in the analytics pipeline and the potential pitfalls to which it can lead.
Most organizations do a haphazard job of data inventory and cataloging at best. It's not uncommon to learn after an assessment by a third party that a given business-to-consumer organization has massive amounts of personally identifiable information (PII) duplicated in hundreds if not thousands of different places.
The most useful enterprise data is initially inaccessible to data innovation teams, with months of negotiation sometimes necessary to free up one data set or another from whichever data cartel controls it. Often, leadership does little blocking and tackling to help.
On a one to five maturity scale, with five being the highest ranking, most organizations have reached only level one or two. What this means in practice is that quite a bit of data is not analytics ready. The consequence of a lack of data maturity is that analytics teams must bootstrap data evaluation and cleaning, leaving much less time to do actual analysis.
As organizations utilize pipelines to gather, process and utilize more data than ever before, complying with regulations is also growing in complexity and number. Even knowing what regulations are applicable is a challenge. A few years ago, the EU's GDPR had just passed. Then the CCPA followed.
Those laws were just the beginning. Most recently, China is poised to impose an even stricter data mobility law, as well as other requirements on information security in the public and transportation sectors and implementing an identity blockchain.
Much coming regulation will demand proof of compliance of various kinds. This trend translates into more compliance headcount, documentation and reporting, particularly for heavily regulated industries that have extensive global supply chains, whether physical or digital.
Out-of-control SaaS bloat, application sprawl and data siloing
SaaS management software company Zylo now estimates the average enterprise has 600 SaaS applications in use, with 10 new applications added each month, each with its own database and data model. Data generated by these apps is literally all over the place. And yet, each SaaS provider has its own quirky way of providing access to the data that each SaaS generates.
At the same time, most SaaS subscriptions are underutilized. There are many subscriptions to many different application bundles with lots of overlap between bundles. This can cause the workforce to be confused about what functionality they should use, in which suite and why. As a result, each app may lack a critical mass of data to tap into for analytics purposes.
With a dozen or more applications in use, manufacturers, for example, might have analysts switching back and forth repeatedly between apps just to get a more cohesive view of troublesome processes, to troubleshoot those processes.
Considering pipeline automation alternatives
It's not surprising that all-in-one pipeline automation has become a holy grail for some platform providers. Many enterprises share the same cloud providers, the same department-level SaaSes, and the same types of de facto-standard databases.
The clear logic behind an all-in-one platform like Gathr, for example, is that companies will often need the same connectors or "operators," much of the same drag-and-drop machine learning process assembly, and the same sorts of choices between, ETL, ELT and ingestion capabilities. Unifying all this functionality could mean less work for data and analytics teams.
But enterprises should remember that the compulsion to subscribe to yet another SaaS extends to these platforms. Engineers in one business unit might gravitate to a Gathr, while others might favor an Alteryx to map together sources a BI platform might need, or a super SaaS like OneSaaS that allows simplified mixing and matching within the OneSaaS environment.
Long-term best practices and data governance for analytics pipelines
Data strategists should realize that such platforms provide only a starting point, a short-term solution for streamlined data flows from common sources given the immediate circumstance and the immediate need. Without a transformed data-centric architecture, companies could unwittingly add to the technical and data debt they already face. A year or two down the road, the next new pipelining platform that enters the market could be just as tempting.
The root cause of the data struggles and lack of governance that companies have is a complexity and therefore a lack of data visibility that doesn't have to exist. Instead of contributing to the complexity, enterprises should build rather than buy with more of a bespoke effort that supports data centricity and a data workforce with diverse ways to examine and attack challenges.
Industry supply chain consortia could be a good way to experiment along these lines. For example, in areas like high-rise construction it underscores what's possible when companies start with less centralized, federated solid data storage and sharing pods, and a single knowledge-graph enabled data model for all providers in the chain.
Success in data governance in the analytics pipeline ultimately comes from reducing the size of the problem footprint dramatically in ways like this one. Suddenly, the companies in the consortium could realize that they're no longer duplicating anywhere near as much data as they used to, because the system they're using is designed to sidestep the tendency and the need for duplication.