Challenges in data sharing and transfer

I recently spoke at the apply(meetup), an event bringing together ML and data practitioners. Thanks to Kevin, Demetrios, Tecton for making this happen! Below is the transcript of my 10-minute talk.

The various parts of an ML system. From Hidden Technical Debt in Machine Learning Systems.

Introduction

Not so long ago, I met with over 20 MLOps and another 10 DataOps companies to understand their challenges in achieving customer success during the platform evaluation phase. In particular, I was interested to learn of their workflows at the very first step in the evaluation process — that of data collection and transfer. I had a hunch this part of the pipeline posed challenges, based on my own experiences in evaluating these platforms.

I‘ll go over how the various platform companies address the existing challenges, some of the needs that remain unaddressed, and finally propose a new solution useful to both platform companies and their customers.

Avoid data transfer

To preempt the challenges that go along with customer data transfer, platform companies may choose to avoid transfer all together.

Public or synthetic data

The easiest solution is to use public or synthetic data sets to obtain benchmarks for comparing and evaluating platforms. However, achieving good performance on public data sets is not as convincing as achieving good performance on the customer’s own data.

Bring-your-own-cloud

Another common approach in enabling customers to evaluate a piece of software is to allow them the option to run the code on their own cloud. This works particularly well for open source solutions, and also for proprietary solutions that can be packaged and installed.

The challenge with this approach is that platform companies need to ensure that customers are savvy to use the platform as was intended—to ensure a fair and correct assessment of it capabilities and features.

This is difficult to achieve if the platform requires upfront training investments and is one of the reasons why companies skip the self-hosted approach.

Further, by taking on the task of hosting, platform companies can more effectively manage versioning, bugs, and customer service.

Another drawback of a self-hosted solution has to do with compute resources. Platform companies usually tune services in order to manage costs and performance. Sometimes they move data to where GPU costs are affordable. This is not possible in the self-hosted scenario.

The good news is that many platform companies are now offering cloud-native solutions spanning a variety of cloud providers. In this scenario, post-evaluation and post-contract, no transfer of data would be necessary.

But data transfer is inevitable

However, pre-contract and during the evaluation phase, customers are asked to transfer data to platform companies such that the full potential of the tools are put on display in a managed environment. This would also ensure that the results closely mirror the outputs as they would be in a real-world setting. And so platform companies and their customers value and prefer data transfer earlier in the relationship.

Data transfer is not that simple

Let’s go over some of the limitations of data transfer tooling today.

SFTP is not quite right for the cloud

One of the most popular workflows for data transfer is for the customer to set up an SFTP server, download data from their cloud data warehouse into a CSV, and then upload that data to the SFTP server. The receiver would then download the data from the server and upload it to their cloud warehouse. This last step usually requires some non-trivial tool.

Another approach would be to replace the SFTP server with a cloud storage bucket. In either case, data is extracted as a CSV file where important type information is lost, data is temporarily stored on a different or local machine, and is then uploaded to the cloud at the destination.

The status quo in cloud-to-cloud data transfer involves humans at multiple steps in the process.

This is a manual process and complex—particularly if data is to be kept in sync. With humans in the loop, it is highly error-prone too. Monitoring is non-trivial as the pipeline leaves the confines of a single company.

Note that there are a variety of custom tools for cloud-to-cloud transfer. However, the market is fragmented such that a solution would be specific to a particular cloud provider. Managing egress costs is a headache in and of itself. This custom and ad hoc work is not something either side of the transaction is really motivated to take on.

Complex pipelines leave the scope of the data team

As the data transfer pipelines become complex, they leave the scope of the data team and the task falls on the engineering or DevOps teams to facilitate the transfer. This is not ideal as platform companies want to interact directly with the teams that is ultimately the user of the platform. As they iterate on the data to be transferred, it would be ideal to work directly with the team that has the domain knowledge. A good tool empowers data teams to independently oversee these transactions.

Managing compliance is difficult

Platform customers are required to stay compliant with HIPAA, GDPR, CCPA, and other legislation depending on the type of data they manage.

Without the right tools, maintaining security and staying compliant is often close to impossible. For example, local machines are not the best place to store data, even temporarily, before uploading to an SFTP server.

Testing is difficult

Sometimes customers inadvertently send data containing PII (personally identifiable information) without the right contracts in place. In this scenario, platform teams become responsible for scrubbing the data.

The status quo in data sharing and transfer means that it is difficult for either side to implement tests to validate the data in flight, for e.g., “Don’t send PII,” “Adhere to the following schema,” “Ensure data completeness.” Further, there is no reliable way to communicate requirements. Platform companies might require that for a forecasting task, data contains entries for all of the days of the week. For fraud detection, they might require that all the main categories of fraudulent activity are included.

Versioning is difficult

When customers interact with multiple platform companies, each one might get a different version of the data. It’s hard to keep track of the source of truth, affecting the fairness of the evaluation. Which company has which data? This lack of versioning also makes it difficult to manage audits and compliance.

Properties of a new solution

Is there be a better way to go about data exchange with business partners? Below, I explore some of the properties of a new solution.

Interoperability

A cloud-agnostic solution that works for any two peers means platform companies and their customers focus on the task at hand.

Auditability and consistency

Making it simple to share a single version of the data set with all platforms, ensures a fair evaluation. Further, managing a ledger of all transactions takes the challenge out of audits.

Security and compliance

Staying secure and compliant is easy when you have the right tools. Handling permissions and processes via toggles that enable masking and encryption allows companies to share data securely and quickly.

Data validation and testing

Both sides of the transfer can run tests to validate the data in flight. Customers need to ensure that they are not mistakenly transferring private fields. Invoking automated tests that align with the various types of legislation is invaluable. Further, platform companies may have certain requirements about the data being shared. They too can create tests to ensure the conditions on customer data are met.

Conclusion

Unlike traditional software applications, evaluating and testing ML systems relies on quality data and lots of it. In every evaluation, much time is spent in bringing data to the platform. We aim to help customers move from manual, script-driven transfers to a solution that is faster and safer.

The public and private clouds aren’t connected. Let’s build some pipelines.

And it’s not just ML companies facing such challenges. Data transfer with business partners and across organizations is an underserved problem and impacts every single industry. In a future post, we discuss in detail the more general problem of cross-organization data transfer and sharing.

PS Would you like to collaborate? Please get in touch via email and follow our progress on Twitter!

PPS Check out James Le’s thorough overview of this year’s apply(meetup).

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store