Databricks Launches Delta Open Source Sharing Project
Delta Sharing is an open protocol for securely sharing data between organizations in real time, completely independent of the platform on which the data resides.
Databricks made several announcements during this week Data + AI Summit. The first of these was the launch of a new open-source project called Delta Sharing, the world’s first open protocol for securely sharing data between organizations in real time, completely platform independent on where the data resides. Delta Sharing is included in open-source Delta Lake Project and supported by Databricks and a wide range of data providers including NASDAQ, ICE, S&P, Precually, Factset, Foursquare, SafeGraph, and software vendors like AWS, Microsoft, Google Cloud, and Tableau.
The solution targets a common industry problem. Namely, data sharing has become essential for the digital economy as companies want to easily and securely exchange data with their customers, partners and suppliers, like a retailer sharing timely inventory data with each. of the brands they sell. However, data sharing solutions have always been tied to a single vendor or commercial product, which has the effect of tying data access to proprietary systems and limiting collaboration between organizations that use different platforms.
In a call with RTInsights, Joel Minnick, vice president of marketing at Databricks, explained the purpose of Delta Sharing. âWhat you have now is a proliferation of a bunch of data sharing network silos that allow people to share some of the data with certain people from time to time. And it has been since the 80s, but we always see new entrants in the market, creating new networks for sharing proprietary data.
He continued, âOur heritage, our roots are still open source. It sounds like a problem that could be solved in a really effective way if we approach it from an open point of view. “
He noted that Delta Sharing fixes a few issues. The first is that it is a fully open and secure protocol for sharing data, so it removes any proprietary locks. But it also solves a second really big problem, which is that many of these data sharing networks and data sharing tools that exist today were designed to share structured data. And that’s what they rule, and what they express is most of the time just an SQL interface.
Minnick noted that the types of data customers want to share these days increasingly tend to be unstructured. For example, companies often want to share images, videos, dashboards, and machine learning models.
Delta Sharing is designed to also support data science and be able to provide governance to unstructured data, as well as self-expression, not only through SQL, but through Python. And so, it can meet the needs of data engineers, data analysts and data scientists.
These points were emphasized during the announcement. âThe main challenge for data providers today is to make their data easily and widely usable. Managing dozens of different data delivery solutions to reach all user platforms is untenable. An open and interoperable standard for real-time data sharing will dramatically improve the experience for data providers and data users, âsaid Matei Zaharia, chief technologist and co-founder of Databricks. âDelta Sharing will standardize the way data is securely exchanged between businesses, no matter what storage or computing platform they use, and we’re excited to make this innovation open source.â
The bottom line is that Delta Sharing extends the applicability of the Lakehouse architecture that organizations are rapidly adopting today, as it enables an open, simple, and collaborative approach to data and AI within and now between organizations.
Improved data management
Also at the summit, Databricks announced two improvements to the data management of its Lakehouse platform. They include:
Delta Live Tables, which is a Databricks platform cloud service that makes ETL (Extract, Transform, and Load) capabilities easy and reliable on Delta Lake to help ensure data is clean and consistent when used for analysis and machine learning.
Unity Catalog simplifies data and AI governance across multiple cloud platforms. Unity Catalog is based on the ANSI SQL standard to streamline implementation and standardize governance in clouds. Unity Catalog also integrates with existing data catalogs to enable organizations to build on what they already have and establish a future-proof, centralized governance model without costly migration costs.
Bringing together data and machine learning
The company also announced the expansion of its machine learning (ML) offering with the launch of Databricks Machine Learning, a new, purpose-built platform that includes two new features: Databricks AutoML to increase model building without sacrificing control and transparency, and Databricks Feature Store to improve discoverability, governance, and reliability of model functionality.
With Databricks Machine Learning, new and existing ML capabilities on the Databricks Lakehouse platform are organized into a collaborative, role-based product surface that provides ML engineers with everything they need to build, train, deploy, and manage ML models from experimentation to production, uniquely combining data and the entire ML lifecycle.
Databricks Feature Store streamlines ML at scale with easy feature sharing and discovery. Machine learning models are built using features, which are the attributes that a model uses to make a decision. Feature Store makes it easy for data teams to make it easy to reuse features across different models to avoid rework and duplication of functionality, which can save data teams months of developing new models. Features are stored in Delta Lake’s open format and can be accessed through Delta Lake’s native APIs.