First International Workshop on Composable Data Management Systems, 2022

Workshop Venue: Sydney, Australia - Co-located with VLDB 2022

Workshop Date: 9th September 2022

Keynote Talks

Title: DuckDB - A Modern Modular & Extensible Database System
Speaker: Mark Raasveldt, CTO, DuckDB Labs & Postdoctoral Researcher, CWI

Speaker Bio: Mark Raasveldt is the co-founder of DuckDB Labs, where he currently works as CTO and lead developer on the DuckDB database system. He is also working as a postdoc in the Database Architectures group at the Centrum Wiskunde & Informatica (CWI). Mark did his PhD at the CWI, working on efficient integration of machine learning and analytics programs with relational database management systems.

Abstract: DuckDB is a fast analytical embedded database system that has been designed with modularity and extensibility in mind. For users, it is important that DuckDB can efficiently interface directly with many different storage formats, and that DuckDB can be easily extended with their own functionality. Many users are also interested in using only parts of the system, for example the execution engine, or the parsing layer. For researchers working with DuckDB, it is important that they can easily extend the system at various layers with their own algorithms. In this talk I will talk about the modularity and extensibility of the DuckDB system, and the various system design decisions we have made to allow for this.

Title: Shared, High-Performance Software Components for Shared, High-Performance Hardware
Speaker: Xiaosong Ma, Principal scientist, Qatar Computing Research Institute, HBKU

Speaker Bio: Xiaosong Ma is currently a Principal Scientist at Qatar Computing Research Institute. Her research interests are in the areas of graph systems, distributed/cloud computing, and storage systems. Xiaosong has published over 100 research papers and currently serves on the editorial board of the ACM Transactions on Storage. She received both the DOE Early Career Principal Investigator Award and the NSF CAREER Award, and was named a University of Illinois Department of Computer Science Alumni Distinguished Educator and an ACM Distinguished Member. Xiaosong received her Ph.D. from the University of Illinois at Urbana-Champaign in 2003, and her B.S. from Peking University in 1997.

Abstract: Today clouds and datacenters have become the default platforms for many applications, sharing hardware resources at different levels. Though current-generation processors, networks, and storage are increasingly designed for such sharing, individually optimized applications do not easily achieve predictable performance or good overall resource utilization. In this talk, I will share my thoughts on shared software components for running on shared hardware and discuss several related systems developed in our research.

Title: Substrait: Rethinking DBMS Composability
Speaker: Jacques Nadeau, Co-founder & CEO Sundeck

Speaker Bio: Jacques Nadeau is co-founder & CEO of Sundeck. Previously, he was CTO and co-founder of Dremio. Jacques is co-creator and founding PMC chair of Apache Arrow, co-creator of Substrait and an Apache Calcite PMC member.

Abstract: We will give an overview of Substrait, a new (started September 2021) Apache-Licensed open source effort to develop a standardized way to represent compute operations and query plans. We'll start by talking about the dynamics that led to the creation of Substrait (including LLVM IR and JVM bytecode inspirations). We'll then give an overview of how Substrait primitives allow expression of common plan patterns. We'll then discuss Substrait's model of disciplined extensibility. Lastly we'll cover the large number active Substrait integration efforts including those underway in Ibis, Arrow, Velox, Presto, Spark, and DuckDB.

Title: Evolving an Organization's Data Management Discipline in support of their Machine Learning Activities
Speaker: David Cohen, Senior Principal Engineer, Intel

Speaker Bio: Dave is a Senior Principal Engineer in Intel's Data Center and Artificial Intelligence (DCAI) business unit. He focuses on large scale data management challenges being faced by Intel's Loud customers. Prior to Intel, David was a Director in the Office of the CTO at EMC where he lead efforts related to integrating storage systems with network virtualization. David also has a long-history of working on building distributed systems in industry, most recently working for the investment banks: Goldman Sachs and Merrill Lynch. An experienced practitioner, Dave's active connections to commercial and academic research and development labs insure Intel’s Data Management Solutions are both well-grounded and cutting-edge. An acknowledged industry expert in system architecture and development, Dave is a sought after speaker and published author.

Abstract:Since the inception of the World Wide Web, companies have been mining web-server logs to analyze user interactions. The rise of Smart-Phones, their video and photo capabilities, and Smart-Phone-resident applications continues to drive year-on-year data growth at phenomenal rates. By the mid 2010s, Google observed that data generated annually by their Youtube service was growing faster than annual improvements in CPU performance. By 2016 the industry began to employ offload accelerators (e.g. GPU, TPU, etc) to augment CPU processing capabilities in order to keep up with the deluge of incoming data. It has been noted that today data growth is continues to outstrip performance improvements in the computational plant, even after incorporating offload accelerators.

Over the same period, companies such as Alibaba, Amazon, Google, Meta, Microsoft and others have aggressively adopted, and innovated with machine learning. These companies have seen an explosion in the dataset sizes used for training these models as well as the frequency of decisions the models produce in the serving environment. The number of parameters in a single model has grown to be in the billions to trillion range. Models of this size consume significant amounts of infrastructure resources at training and serving time. Training a model of this size can take weeks to complete.

These data growth and machine learning trends are reshaping the data management discipline within companies that operate at this scale. For example, queries originating from machine learning activities have become a driving force in the utilization of data warehouse resources. In this talk we’ll discuss how data management tools are used throughout the data processing cycle: from log capture, data ingestion into the data warehouse, preprocessing, and feature engineering in support of model training and serving. The focus will be on positioning each of these steps within a broader Recommendation System scenario. We’ll conclude with a discussion on the changing landscape of the infrastructure that hosts machine learning and data management workloads.

Title: From parsing, to query processing, to resiliency: Deriving unexpected value from database technology through componentization
Speaker: Jonathan Goldstein, Researcher, Microsoft

Speaker Bio: Over the last 20 years, Jonathan has worked at Microsoft in a combination of research and product roles. In particular, as a researcher at MSR, his research contributions include work in streaming, big data processing, databases, and distributed computing. Componentization is a common theme throughout his work. For instance, his work on stream data processing resulted in a widely used uniquely flexible and perfomant query processor called Trill. Within the academic community, he has published many papers, some with best paper awards (e.g. Best Paper Award at ICDE 2012), and two with test of time awards (e.g. SIGMOD 2011 Test of Time award and ICDT 2018 Test of Time award), and has also taken many organizational roles in database conferences. His research has also had significant impact on many Microsoft products, including SQL Server, Office, Windows, Bing, and Halo, as well as leading to the creation of new products like Microsoft StreamInsight, Azure Streaming Analytics, and Trill, and he spent 5 years as a founder and architect of Microsoft StreamInsight. Trill has become the de-facto standard for temporal and stream data processing within Microsoft, and years after its creation, is still the most versatile, expressive, and performant stream data processor in the world.

Abstract: Over the decades during which databases have evolved, our community has developed an impressive array of technologies which has been incoporated into our database artifacts, starting with the original relational DB servers, like Ingres and System R, to today's rich complement of cloud database services. While these technologies are critical for providing the high value that these services offer, much of them have been locked away in monolithic artifacts. In this talk, I will explore our efforts at MSR to offer some of these technologies through new artifacts, which can be seen as a componentization of today's full featured database products and services. In particular, this talk will focus on 3 projects which we've heavily invested in, which componentize high performance parsing (Mison), query processing (Trill), and resiliency (Ambrosia). This talk will discuss the technology componentized by each of these projects, and how these artifacts have been incorporated into other projects that never would have used monolithic databases. We will conclude with lessons learned, from both our successes and our failures.