2017 will be remembered as the End of our Data Innocence.
We saw next-level data breaches (Equifax, Yahoo!, SEC, Uber, etc), the #FakeNews epidemic, political weaponization of Social Media, and the recurring threats (both hyperbolic and very real) on the hazards of unchecked AI.
Data-related events are escalating in public visibility and impact, and pose one of the greatest threats to the advancement of the tech industry that we’ve ever seen. And while it’s easy to sit back and blame criminals, rogue nations or other bad actors, the time for being passive is behind us.
Data Responsibility must become a priority for our industry.
Every person, organization, system, sensor, and intelligent machine that interacts with data has a responsibility for that data. It’s unacceptable that we’ve created the technology to trace a single organic banana every step of the way from farm to table, but we can’t tell you who touched a specific piece of data or where it’s been copied. In 1998 perhaps we could have argued this was too difficult to solve, but it’s now 2018 and our industry simply cannot continue to make excuses. Just this week, Ginni Rometty (Chairman, President, and CEO) of IBM made strong statements in support of data responsibility, a move we hope to see from other CEOs and industry leaders in 2018.
Data becomes “too valuable to use”
We will not solve this problem by pointing fingers at security, or infrastructure, or data teams. Architects, DBAs, Developers, Analysts, business teams, and any person or system involved in the end-to-end data pipeline all bear responsibility for acknowledging their roles in Data Responsibility. Just like the people and companies that fertilize, pick, transport, or sell the bananas take responsibility for their harvest, so must we.
Recognizing the Data Responsibility Problem
We created PencilDATA last year to solve the problem of realizing data value. Today, the more valuable a specific dataset becomes, the more protected and less accessible it becomes to the teams that could put it to the best use. These valuable datasets get locked down by regulations, contracts and IP protections so that access becomes severely restricted. Projects like training a hungry machine learning model are delayed or derailed because the data science team just can’t get access to the right training data — it’s effectively been deemed “too valuable to use”.
How does that model make any business sense? What we came to realize in talking to early customers and partners is that it’s easier (and safer) to not use the data at all than it is to risk a data breach or bad news headline. That isn’t a solution, and is exactly the problem we’re solving at PencilDATA.
The problem, put simply, is that we don’t know where our bananas are.
The lack of an end-to-end data responsibility toolchain means that data owners are deprived of capturing the true value of their data, and also means that criminals and other bad actors have plenty of opportunity for unchecked access — something we saw taken to a new level in 2017.
Unlike the banana industry, we haven’t implemented a trusted model for tracing the end-to-end lifecycle of data. Instead, we rely on dozens of individual and disconnected systems, people, and processes that by all measures have room for improvement. The problem, put simply, is that we don’t know where our bananas are.
PencilDATA is solving the ‘last mile’ problem of getting valuable data in to the hands of the teams that need it most, but what about the rest of the lifecycle of that data? What happens before that data was “valuable” (or after)? It’s those times when data is most vulnerable, because organizations focus their energy on protecting their most valuable data, which might only represent 5% of the total data they own. Who is responsible for the other 95%?
Solving The Data Responsibility Problem – Our Proposal
We propose an inclusive, community-based open source initiative to define, build, and maintain a pragmatic framework for responsibly handling datasets throughout their lifecycle. We need at least the same level of end-to-end visibility for our datasets that’s afforded to a bunch of organic bananas.
The good news is that we’re not starting from scratch. Many open source projects exist today which if orchestrated around data responsibility can come together and begin to solve this problem. Some examples include:
- Apache Accumulo, Atlas, NiFi & Ranger
- CNCF Linkerd/Conduit, Notary & TUF
- Google Grafeas / Kritis, Istio & their DevOps Supply Chain partners
- Linux Foundation CDLA & Blockchain* projects via HyperLedger
In addition, there’s a long list of vendors including (but certainly not limited to!) AWS, Cloudera, Dell, Docker, Google, IBM, Hortonworks, HP, Informatica, MapR, Microsoft, Oracle, Salesforce, et al that already provide many of the individual components and technologies in use across the industry that can help support the goal of end-to-end Data Responsibility. We welcome their valuable contributions and collaboration with us to solve this industry-wide problem.
[*Exciting new technologies like Blockchain alone won’t magically solve the problem of end-to-end data responsibility, but it does bring an accessible, transparent and immutable level of proof that’s required if we’re going to gain public trust in the results of our efforts.]
Join us to take the first steps together this year on an open source specification around metadata, formats, and implementation for Data Responsibility, starting with an oversight committee and governing board made up of the initiative’s sponsors.