First International Workshop on Composable Data Management Systems, 2022

Workshop Venue: Sydney, Australia - Co-located with VLDB 2022

Workshop Date: 9th September 2022


Invited Talks


Title: Velox: Unifying execution engines across Meta and beyond
Speaker: Pedro Pedreira, Software Engineer, Meta Platforms Inc.

Speaker Bio: Pedro Pedreira has worked as a Software Engineer at Meta for the last decade, focusing on Data Infrastructure. Currently, Pedro leads the Velox program, which is a major effort at unifying execution engines into an open-source library, spanning more than a dozen engines within Meta and beyond. In the past, he worked on log analytics engines (such as Scuba), and Cubrick - an in-memory analytical DBMS he and his team developed from the ground up based on multidimensional indexing ideas proposed in his Ph.D thesis. Pedro holds a PhD and an MS in Computer Science from the Federal University of Parana, in Brazil.

Abstract: Velox is a novel open-source C++ database acceleration library created by Meta. Velox provides reusable, extensible, high-performance, engine and dialect-agnostic data processing components for building execution engines and enhancing data management systems. The library is currently integrated or being integrated with more than a dozen data systems at Meta, including analytical query engines such as Presto and Spark, stream processing platforms, message buses and data warehouse ingestion infrastructure, machine learning systems for feature engineering and data wrangling (PyTorch), and more. In this talk, the main motivations and value proposition behind Velox will be described, along with its main use cases and an outline of the library and its main optimizations.



 
Title: SAP’s global data mesh for composable data management supporting cross-domain consumption
Speakers: Anil K Goel, CTO / Head of Technology Office, SAP HANA Database and Analytics & Mihnea Andrei, Executive Architect, SAP HANA Database and Analytics

Speakers Bio:
Anil K Goel is CTO / Head of Technology office for SAP HANA Database and Analytics at SAP, where he works with the globally distributed data management and analytics engineering teams to drive forward looking architectures, vision, strategy, research, and pathfinding. He also oversees data management and analytics related collaborative research and internship programs with many universities globally. His interests include database system architecture, in-memory and large-scale distributed computing, self-management of software systems and cost modelling. Anil earned a PhD in computer science from University of Waterloo. He holds MTech (CS) from Indian Institute of Technology, Delhi, and BE (Electronics and Communications Engineering) from University of Delhi.

Mihnea Andrei is an Executive Architect at SAP, HANA Database & Analytics, working within the Technology Office and Database organizations on data processing technology. His current focus is SAP HANA Cloud, cloud qualities and cloud-native database architecture patterns; and the SAP Data Plane, the globally distributed data mesh supporting SAP's data integration and using HANA Cloud as its data processing backbone. During his career, Mihnea has worked on the core engines of several database management systems (SAP HANA; Sybase ASE and IQ), covering most engine areas, e.g. query optimization, and execution, database stores (in-memory and on-disk, row and column oriented), transaction processing. He has co-authored a number of database publications reflecting his work, at various database conferences including SIGMOD and VLDB. Mihnea holds a DEA in computer science from Université Paris 6 and an MS in computer science from the Bucharest Polytechnic Institute.

Abstract: SAP delivers a broad set of enterprise cloud applications which cut across a large variety of industry domains. SAP envisions supporting modern businesses by composing its portfolio of applications into an integrated offering called the Intelligent Enterprise Suite. This vision requires cross domain integration of processes, and integration and consumption of data and semantics. In order to achieve this ambitious undertaking we are implementing a global data mesh built upon the SAP HANA Cloud which is a holistic multi-modal cloud data management suite. SAP HANA Cloud is also available for direct consumption by customers. This talk describes the motivations of the data mesh, how it supports Domain Driven Design, its architecture, and the specific composable data management topics that we have encountered in elaborating and then implementing it.



Title: Coral: A SQL translation and rewrite engine for modern data lakes
Speaker: Walaa Eldin Moustafa, Senior Staff Software Engineer, LinkedIn

Speaker Bio: Walaa Eldin Moustafa is a Senior Staff Software Engineer at LinkedIn, where he works on building big data infrastructure and solutions for enabling unified and performant data processing systems across different compute engines, storage representations, and language APIs. Walaa holds a PhD degree in Computer Science from the University of Maryland at College Park. He has co-authored a number of database publications at various database conferences including SIGMOD, ICDE, and IEEE Big Data in topics that focus on modern applications of relational and deductive database management systems, such as graph query processing, machine learning, data integration, and probabilistic databases.

Abstract: In this talk, we present Coral, a framework for achieving logic interoperabiltiy between SQL engines. Coral defines a standard intermediate representation (IR) that is used to express different SQL dialects, relational languages, and query plans. Coral implements a set of adapters to convert between various input and output representations by mapping their semantics to and from Coral IR. Currently Coral supports a number of conversions between Hive, Spark, Trino (or Presto) dialects of SQL. Further, we discuss Coral applications such as making views portable across execution engines, and its future extensions to support common query optimizations such as materialized views selection and substitution, SQL pushdown, and incremental compute.



Title: The Single Product Approach: Supporting Diverse Workloads with Snowflake's Query Engine
Speaker: Jiaqi Yan, Principal Software Engineer, Snowflake Computing

Speaker Bio:Jiaqi Yan is a Software Engineer at Snowflake, where he leads development efforts in Snowflake Database's Query Compiler and Workload Optimization features. Prior to joining Snowflake, Jiaqi worked on Oracle's Query Optimizer.

Abstract: Today, cloud providers are constantly launching new services, pushing an ever growing level of complexity on customers. Different products have various learning curves, name similar concepts in different ways, have different security models, and different pricing and business models. This offers a level of agility to providers, but comes at a cost for customers to sort through multiple options and piece solutions together. In this talk, I will discuss Snowflake’s Single Product Principle, how this helps simplify customer experience, and how this is different from simple integrations across multiple products.



Title: GoogleSQL: A SQL Language as a Component
Speaker: David Wilhite, Senior Staff Software Engineer and Manager, Google

Speaker Bio: David Wilhite has worked as a Software Engineer at Google for 10 years, where he currently leads a team building componentized infrastructure for large-scale analytics use cases, which includes the GoogleSQL component. Prior to joining Google, David worked across the DBMS stack (compiler, optimizer, execution) at various DBMS companies including Red Brick Systems and ParAccel.

Abstract: Google has built many systems using SQL: public cloud products like BigQuery and Spanner, internal systems like F1 Query and Procella, open source ZetaSQL, and other systems that use SQL as a declarative API for data processing and management outside traditional SQL databases and query engines. Building GoogleSQL as a component has enabled easy reuse across many systems, with full language compatibility.

Sharing a language definition and parser alone is not sufficient to achieve real consistency or easy reuse. The GoogleSQL libraries include shared language analysis, shared implementations of common functions and operators, a compliance testing framework for validating consistency across engines, and other shared subcomponents.

The consistency that results from sharing the language and implementation across tools has enabled the growth of a large ecosystem of interoperable tools that can easily leverage the full power of SQL, while still allowing domain-specific extensions and customizability.



Title: Delta Lake: The one storage format for all your analytics
Speaker: Tathagata Das, Staff Software Engineer, Databricks, Apache Spark PMC

Speaker Bio:Tathagata Das is a Staff Software engineer at Databricks, and a member of Apache Spark Project Management Committee (PMC). He has been involved with the Apache Spark project for the last 12 years. He developed the original Spark Streaming (DStreams) in his grad student days in AMPLab, UC Berkeley. He was one of the core developers of Structured Streaming, and since 2018, one of the core developers of the Delta Lake product. Currently, he leads the development of the Delta Lake open source project and the ecosystem around it.

Abstract:Delta Lake is an open-source storage format that brings ACID transactions to big data workloads on cloud object stores. Traditionally, there has been two kinds of data management frameworks:

  • Databases that provide transactional guarantees but are not scalable in a cost-effective way
  • Data lakes that provide cost-effective scalability, but limited transactional and data quality guarantees

The Delta Lake format combines the best of both worlds for analytical workloads:

  • Scales in a cost-efficient by storing data - both data and metadata scale to 10s of TBs of data
  • Provides data quality guarantees - ACID transaction, schema enforcement, data constraints
  • Simplifies modern workloads on ever changing data - Schema evolution when appending or upserting data

Delta Lake allows you store data on blob stores like HDFS, S3, Azure Data Lake, GCS, query from many processing engines including Spark, PrestoDB, Trino, Hive, Flink, and provides APIs for Scala, Java, Python, Rust, and Ruby. In this talk we are going to discuss

  • Architecture - Why does it scale so well?
  • Features - What makes it so unique?
  • Roadmap - What is in the future?
  • Connector ecosystem - What can you read / write from?
  • Community - How to contribute and engage?