(Data ∩ Water) Terms
All at sea, exploring the data dictionary
Data toolmakers and the larger data community have long been reclaiming water terms to refer to products, services, actions, and concepts.
There is a wealth of discussion about whether data is more like oil or water. Most recently data has been likened to electricity. These surface-level disagreements are water under the bridge, because like the fish in the sea, water terms are aplenty in the data vernacular.
To further explore this sentiment, I gather here, a list of fifty-eight terms shared between the two domains. The list is sorted in alphabetical order. Let’s dive in.
Started and open-sourced at Lyft, Amundsen is a data discovery tool. It provides a powerful search experience by indexing all data resources and performing page-rank style search based on usage metrics.
Stemma is a venture backed, fully managed data catalog system built atop Amundsen.
In stream processing, backpressure is a suite of methods to prevent or manage data loss when the rate at which data arrives is higher than the rate at which it can be processed.
A popular implementation of these methods can be found in Akka Streams. For a detailed definition, see this post on methods to resist the flow of data in software by Jay Phelps.
Bigeye (fka Toro) is a tool for monitoring data health. Bigeye connects to the data warehouse and automatically builds health metrics with anomaly thresholds for tables selected by customers. With actionable alerts, customers are able to act on fixing errors in the data pipeline.
Boiling Data offers fast SQL queries on an S3 data lake. Data is not moved but queried in place on AWS S3, using embedded DuckDB or SQLite3 instances running on AWS Lambda.
Apache Calcite is a Java-based framework for query parsing, validation, conversion, processing, and optimization. The optimizer is extendible with rules and cost functions.
Calcite has adapters for processing queries on Cassandra, Elasticsearch, Kafka, PosgreSQL, and more. Further, Apache Hive, Apache Solr, Apache Flink, Apache Drill, and Cascading are all powered by Calcite.
Originally created by Chris K Wensel in 2008, Cascading is an open-source framework for distributed data processing, data integration, query planning, and scheduling. Written in Java, Cascading was built to run on a Hadoop cluster, abstracting away the complexities of MapReduce. Scalding, which we cover later in this list, is a Scala DSL for Cascading.
Data cleaning is the series of steps taken to prepare data for analysis, visualization, and machine learning. Data cleaning is a non-negligible and non-trivial part of the data science workflow. It should be noted that the data collection and modeling steps greatly impact the level of data cleaning efforts necessary to make use of data. As such, a common reason for failure in data projects has to do with the lack of involvement of the data team in the data collection and modeling steps.
The concept of the cloud—invented in the 1960s by J.C.R. Licklider—is the ability to access data anywhere and any time. Cloud storage is distributed but federated, fault-tolerant, durable, and eventually consistent. According to Statista, as of 2020, 50% of corporate data is stored in the cloud.
An open-source library by LinkedIn, Coral translates HiveQL views to make them accessible in Trino (fka Presto), Spark, or Pig. More details here.
A service from Amazon, Datalake Formation allows for the creation and management of a data lake. It enables data integrations from various sources, data processing via monitored ETL jobs, data governance and security management, and data catalog management. See a tutorial by Alessandro Gaggia here.
Similar to Iceberg and Hudi, Delta Lake is an open data format that enables ACID transactions on a data lake. Delta Lake was created at Databricks—the team behind Apache Spark.
Delta Engine is a query engine by Databricks that uses a proprietary version of Delta Lake with improved read and write performance.
Collecting data with no governance, no quality control, and no oversight creates a deluge. Organizations can avoid a deluge by prioritizing investments in data teams and infrastructure early in their lifetime.
1 In machine learning, concept drift is when the relationship between the input and output variables of a model change over time. A concept is defined as the joint probability distribution of a model’s input, X, and output, Y, expressed as
P(X, Y) = P(X)P(Y|X).
Concept drift happens when this joint distribution changes, usually due to a change in the posterior probability, P(Y|X).
2 Data drift is when the distribution of the input data, X, changes over time. The change in P(X) is also known as covariate shift.
Apache Drill is an open-source SQL engine written in Java. It is partly inspired by Google’s Dremel, a service used in BigQuery. Drill provides an MPP execution engine. This implies that like Trino and unlike Hive, Drill does not depend on MapReduce. Nevertheless, Drill can read from HDFS—among other data storage systems, including HBase, Amazon S3, Google Cloud Storage, and MongoDB.
MapR included Drill as part of their Hadoop bundle. HPE acquired MapR and later dropped support for Drill.
“Drowning in data” is a popular phrase, often times followed by “starving for insights.” It refers to the phenomenon of overflowing data lakes and swamps that are difficult to manage and tame. With the popularization and availability of data lake technology, organizations have been collecting data without bounds. However, few organizations have been investing in data teams to manage, organize, and make effective use of that data. Thus, organizations are drowning in data but lack access to its benefits.
1 A database dump is a backup of the data in the form of a transaction log from which the full database can be recovered.
2 A data dump is a large amount of data, structured or unstructured, transferred from one system to another—sometimes across organizations. Data consultants and those working on data projects with external partners know the drill.
Filtering is one of the steps in data cleaning or in data exploration. Filtering is used to exclude particular data entries by way of rules.
Most data libraries and tools have filters as a first-class feature.
The Twitter Firehose, now available via GNIP and DataSift, is a historical term referring to the Twitter API delivering 100% of the Tweets that match a certain criteria. See this page to learn more about the various options available from the company’s Developer Platform.
Fishtown Analytics, newly rebranded as dbt Labs, is the company behind dbt. dbt is an open source platform for managing and deploying SQL code.
Apache Flume is a distributed platform used to ingest streaming data, from various logging services, into HDFS. In this regard, it is similar to Apache Kafka—albeit with differences. The last stable release of Apache Flume is two years old.
Fog computing, coined in 2014 by Cisco, brings compute resources closer to user devices. A departure from cloud computing, it excels in time-sensitive situations or harsh environments with low internet connectivity. Given that storage and processing take place in a LAN or on a local device, fog computing can be more secure than cloud computing.
Amazon Glacier is a low-cost S3 option for long-term backup and archiving.
H2O is an open-source, linearly scalable, ML platform which can ingest data from a variety of sources. H2O models can be deployed as POJO or MOJO.
To hydrate is to attach Tweet metadata to a Tweet ID.
Note that hydration in web development is not a data-specific term. Further, in object-oriented programming, there’s an even broader use for the verb.
Originally developed at Netflix in 2018, Apache Iceberg provides an open table format for the data lake. Akin to Delta Lake and Hudi, Iceberg enforces ACID-compliant transactions on cloud data stores. Iceberg can be used with Spark, Trino, PrestoDB, Flink and Hive. While Iceberg excels at managing huge tables, it does not do as well for deletions or mutations.
The data lake is a central repository of all the data in an organization. The lake stores unstructured, semi-structured, or structured data. The lake usually contains raw copies of the data and perhaps transformed and processed versions thereof. Both of these characteristics stand in contrast to the data warehouse which is designed to store structured and transformed data.
Examples of data lake technology include HDFS (part of Apache Hadoop), Google Cloud Storage, and Amazon S3. See a comparison of HDFS and Google Cloud Storage.
A portmanteau of data lake and data warehouse, ‘lakehouse’ is a term popularized by Databricks in early 2020. The lakehouse paradigm is an evolution of the concept of the data lake, with the goal of mitigating the need to manage both a data lake and a warehouse. This unified approach prevents redundancy, decreases costs, and simplifies governance.
Google BigQuery, Apache Drill, Azure Synapse Analytics, Amazon Athena, and Snowflake can be considered an instantiation of the lakehouse concept. With the launch of Delta Lake in 2019, Databricks added a structured transaction layer to its Unified Analytics Platform.
Data leak or data breach is the unwanted extraction and transfer of confidential or sensitive data to an external environment.
Originally open-sourced by Erik Bernhardsson at Spotify, Apache Luigi is a data workflow management tool for Python. It is the most popular in its class, only after Airflow.
Data mining is an interdisciplinary field comprising of the subfields of machine learning, statistics, and data storage and management systems.
1 A mirage is when successful data science projects seem like a pipe dream.
2 The Data Mirage is a book by Ruben Ugarte to help executives and leaders make effective use of data in their organizations.
Nemo is a powerful internal data discovery tool at Facebook. Some open source data discovery platforms are DataHub, Amundsen, Metacat, Marquez, and Apache Atlas. For a thorough overview of the features of a data discovery platform, see this post by Eugene Yan.
Sisense for Cloud Data Teams (fka Periscope Data or Periscope) is business intelligence software that connects to the data warehouse and allows users to create and share interactive dashboards.
Periscope is built with the data scientist in mind. When it comes to data scientist experience in the BI tooling space, Periscope blows the competition out of the water.
Permifrost is an open-source project by Taylor Murphy. Written in Python, it’s used for managing permissions on Snowflake.
Data plumbing is a general term capturing the wide variety of technologies used to handle data in the cloud.
Apache REEF (Retainable Evaluator Execution Framework) is a framework which provides abstractions over resource managers like Apache Hadoop YARN or Apache Mesos. REEF mitigates the need to reimplement common functions in cluster management and scheduling. Microsoft Azure Stream Analytics is built on REEF and Hadoop.
Rill is a fully managed service for Apache Druid. Apache Druid is a distributed datastore for realtime analytics and can ingest data from both streaming (Kafka, Kinesis) and batch sources (HDFS, S3). Unlike Scuba, Druid is row-oriented and allows for historical data analysis and reporting.
This page lists third-party companies providing commercial Druid support.
Building and maintaining data connectors is complex and time consuming, more so with the ever-evolving vendor APIs. Rivery (and similar products like Fivetran and Stitch) provide out-of-the-box integrations for data ingestion, data transformation, and orchestration. These ELT tools automate integrations to greater than a hundred databases, SaaS applications, and storage platforms. They also support many destinations, including Google BigQuery, Amazon S3, Azure Cloud Storage, and Snowflake.
Scalding is an open-source Scala DSL for Cascading. Originally built at Twitter by P. Oscar Boykin, Scalding brings the advantages of Scala to MapReduce jobs. Once you get to know the ropes, Scalding is a pleasure to use.
Scuba is a fast and scalable in-memory database for real-time and ad-hoc analysis at Facebook. Aside from a SQL interface (a subset of SQL), Scuba provides an interface for data visualization.
In an organization without a central data function and ownership of data, each department manages their data independently. This distributed ownership creates data silos.
Independent ownership of data seems like an attractive option, however, organizational data managed in silos leads to redundancy. Inconsistency amongst redundant copies of data across silos is common and is a reason for inaccurate analyses. Additionally, siloed data makes it challenging to attain a global view of the business.
Snorkel Flow, the most recent product from Snorkel AI, is a platform for developing, deploying, and managing ML applications. Snorkel’s technology—including programmatic labeling—has been developed at Stanford AI Lab.
Part of the AWS Snow Family suite of products, Snowball is a device for data migration and edge computing that can work in disconnected environments. The Compute Optimized option of Snowball provides 52 vCPUs of compute capacity and 43 TB of object storage, while the Storage Optimized option provides 80 TB of S3-compatible object storage and 40 vCPUs of compute capacity. The Snowball devices can be clustered to build larger systems.
Part of the AWS Snow Family suite of products, Snowcone devices are used to collect and move data to AWS, either by shipping or online. At 4.51lbs, Snowcone is portable and secure and can store up to 8TB of data.
Snowflake is a recently public company (NYSE: SNOW) providing data warehousing as a service. Snowflake can run on Amazon S3, Microsoft Azure, and GCP. Snowflake is SQL-based and ACID-compliant. Due to its auto-scaling and auto-suspend features, Snowflake requires little administration. Snowflake’s virtual warehouses ensure that users and other processes (e.g., ETL pipelines) don’t compete for compute resources.
Part of the AWS Snow Family suite of products, Snowmobile is a shipping container. Equipped with both logical and physical security, it can move up to 100PB of data to load into S3.
Snowpark provides a Scala DSL for SnowSQL. It is currently only available for use with Snowflake accounts hosted on AWS.
Similar to Google Analytics, Snowplow is a marketing and product analytics platform. Its core components are open source, but customers can also sign up for a managed service called Snowplow Insights. See here for a review of the scope of the analytics provided by Snowplow.
Apache Storm is a distributed stream processing library created by Nathan Marz at BackType. BackType was later acquired by Twitter, where Storm was open-sourced. Apache Heron is Storm’s successor at Twitter. Spark streaming and Flink are other popular stream processing frameworks.
1 Streamlit has been making waves in the data science community and living up to its name. Growing out of Google X in 2013, Streamlit is an open-source framework built in React. This ZDNet article draws similarities between Streamlit and BI tools: just as BI tools simplify building web apps based on SQL queries (i.e., dashboards), Streamlit enables building and deploying web apps based on Python scripts. Streamlit works great right out of the box, just add water.
2 Streamlit is a company providing Streamlit support for teams.
An unmanaged and unmonitored data lake is called a data swamp. Data in a data swamp is difficult to access and difficult to use.
In Singer—an open-source data integration framework by Stitch—taps are data extraction scripts. Taps have been implemented for a variety of sources. Further, users can contribute new taps to extend the list of supported integrations in Stitch.
Topcoat is a data analytics platform, currently in beta, which allows data scientists not only more control over the visual output but also more visibility into the data lineage than a traditional BI tool.
Founded by three former Airbnb employees who had worked on or with Minerva, Transform is aiming to become a centralized metrics store for organizations. Transform is made up of three products: Metrics Framework for defining metrics using YAML and SQL, Metrics Catalog for data discovery (includes data lineage and lifecycle information), and Metrics API for querying and visualizing metrics via other tools.
In data warehousing, data transformation represents the ‘T’ in ETL or ELT processes. See here for a list of operations performed during the data transformation stage.
With the digital transformation of exceedingly more industries, there is an increased demand for data infrastructure, data skills, and data teams to enable the evolution. This wave of investments is called the data tsunami.
Linstedt’s Data Vault is one of three popular approaches to data modeling—the other two being Inmon’s subject modeling and Kimball’s dimensional modeling. The purpose of Data Vault is to improve the flexibility, auditability, and resiliency of the data warehouse.
Some terms satisfy the criteria of being at the intersection of the data and water domains, yet the corresponding products, services, actions, and concepts are not exclusive to data applications. Other terms satisfy the criteria but are not as recognizable. To avoid watering down the main list while acknowledging of these terms, I include them here.
Digital Ocean (n.)
log (n., v.)
I recently came across this interesting post by Lauren Balik about data tools with octopus logos. Someone also pointed out OctoML. Inkredible! 🐙
This list evolved from a Twitter thread. In no particular order, I would like to thank Taylor Murphy (2)(3), Seth Rosen, Josh Wills, Chris Wensel, Neil Kodner, DJ Patil, Aaron Richter, Neville Li, Herval Freire, Hugo Bowne-Anderson, Carl Anderson, Aaron Gonzales, Matt Dickenson, Ryan Kopa, Varadaraj N Reddy, Ashish Kumar, Wes Turner, Matt Canute, Ali Taheri, Y S Ramakrishna, Chris Harland, and Reid McKenzie for their contributions.
Special thanks to Houston H. Haynes, Tyler Richards, Parham, and Dan Huang for their feedback on this post, and Kurt Bales and Taylor Murphy for encouraging its creation.