Third International Workshop on Composable Data Management Systems, 2025

Workshop Venue: London, United Kingdom - Co-located with VLDB 2025

Workshop Date: 5th September 2025


Keynote Talks


Title: A Historical Perspective on Extensible and Composable Data Systems

Speaker: C. Mohan, Distinguished Professor of Science (Hong Kong Baptist University, China)
, Distinguished Visiting Professor (Tsinghua University, China)
, Retired IBM Fellow (IBM Research, USA)

Speaker Bio: Dr. C. Mohan is currently a Distinguished Professor of Science at Hong Kong Baptist University, a Distinguished Visiting Professor at Tsinghua University in China, and a member of the inaugural Board of Governors of Digital University Kerala. He retired in June 2020 from being an IBM Fellow at the IBM Almaden Research Center in Silicon Valley. He was an IBM researcher for 38.5 years in the database, blockchain, AI and related areas, impacting numerous IBM and non-IBM products, the research and academic communities, and standards, especially with his invention of the well-known ARIES family of database locking and recovery algorithms, and the Presumed Abort distributed commit protocol. This IBM (1997-2020), ACM (2002-) and IEEE (2003-) Fellow has also served as the IBM India Chief Scientist (2006-2009). In addition to receiving the ACM SIGMOD Edgar F. Codd Innovations Award (1996), the VLDB 10 Year Best Paper Award (1999) and numerous IBM awards, Mohan was elected to the United States and Indian National Academies of Engineering (2009), and named an IBM Master Inventor (1997). This Distinguished Alumnus of IIT Madras (1977) received his PhD at the University of Texas at Austin (1981). He is an inventor of 50 patents. During the last many years, he focused on Blockchain, AI, Big Data and Cloud technologies (https://bit.ly/sigBcP, https://bit.ly/CMoTalks). During 1H2021, Mohan was the Shaw Visiting Professor at the National University of Singapore. Since 2016, Mohan has been a Distinguished Visiting Professor of China’s prestigious Tsinghua University. In 2023, he was named Distinguished Professor of Science of Hong Kong Baptist University. In 2021, he was inducted as a member of the inaugural Board of Governors of the new Indian university Digital University Kerala. Mohan has served on the advisory board of IEEE Spectrum, and on numerous conference and journal boards. During most of 2022, he was a consultant at Google with the title of Visiting Researcher. He has also been a Consultant to the Microsoft Data Team in 2020. Mohan is a frequent speaker in North America, Europe and Asia. He has given talks in 43 countries. He is highly active on social media and has a huge network of followers. More information can be found in the Wikipedia page at https://bit.ly/CMwIkP and his homepage at https://bit.ly/CMoDUK.

Abstract: Compared to the architecture of IBM San Jose Research Laboratory’s System R, in the context of which the language SQL was developed first, the architecture of database management systems (DBMSs) has evolved a lot. As new requirements with respect to the external and internal functionality of DBMSs arose, the initial System R architecture was found to be quite rigid and requiring significant work to accommodate the new requirements. That led to the emergence of several DBMS research projects focused on extensible and composable system architectures. Once open-source software and software-as-a-service (SaaS) concepts became commonplace, they also provided additional impetus to the evolution of DBMS and other middleware architectures. In certain use-case scenarios, such architectures resulted in performance degradation and/or worse price-performance. In such cases, a more tightly integrated hardware-software co-design was found to be better. In this talk, I will trace these historical developments and discuss the features of a few state of the art systems.


Title: Speedrunning a lakehouse: a composable FaaS over object storage

Speaker: Jacopo Tagliabue, CTO, Bauplan Labs

Speaker Bio: Jacopo Tagliabue is the co-founder and CTO of Bauplan. Educated in several acronyms across the globe (UNISR, SFI, MIT), Jacopo was co-founder and CTO of Tooso, an AI startup acquired by TSX:CVO in 2019. He led Coveo's AI from scale-up to IPO, and built out Coveo Labs, a prolific R&D practice whose libraries, models and datasets have garnered tens of millions of downloads.
Throughout his career, he has been fortunate enough to collaborate with incredible folks in industry and academia (e.g. Netflix, NVIDIA, Stanford, Univ. of Wisconsin-Madison), and publish contributions in a variety of fields: Information Retrieval (RecSys, SIGIR), Data Science (KDD), Artificial Intelligence and NLP (ICML, NAACL), Data Management (SIGMOD, VLDB), Computer Systems (Middleware). While building his new company, he is teaching ML Systems at NYU, which is mostly notable because it is the only job he ever had that his parents understand.

Abstract: The lakehouse is emerging as the foundational design for data and AI workloads. Heterogeneous runtimes, requirements, integrations are suited for the composable stack, but flexibility may come at the expense of simplicity: how can we reason about such complex systems, both as final users and system developers? In this talk, we discuss the opportunities and challenges of building a "Function-as-a-Service" (FaaS) lakehouse: every workload is served by one or more functions, which unifies the mental model for the user (e.g. pipelines are “chained functions”), as well as developers.
Dissatisfied with existing models, we built a composable system on top of object storage, which required innovation across the entire stack: from function scheduling to containerization, from “Git for Data” to differential Arrow caching. Today, companies run tens of thousands of ephemeral functions as part of Bauplan pipelines: we conclude by discussing the importance of API design for viral enterprise adoption, and why composable user abstractions should hide the composability in the stack.


Title: Theseus a Composable distributed execution runtime: Performance across GPUs, Networks, and Storage

Speaker: Felipe Aramburu, Distinguished Architect, Co-Founder at Voltron Data

Speaker Bio: Felipe has been working on accelerated SQL engines and CUDA computational primitives for over a decade. At Voltron Data, he is the Architect of Theseus, a distributed SQL engine that runs on AMD and Nvidia GPUs. His main focus is optimizing DataFlow architectures in the context of distributed ETL, AI, and ML.

Abstract:During this talk we will discuss how Theseus, a composable distributed execution framework, leverages composability to be scalable and performant across multiple compute accelerators and highly varied hardware configurations. We will start out by looking at two different systems. The first one is an on prem deployment with fast networking, high performant centralized storage, and larger GPUs. The second is by leveraging commodity hardware available in the cloud where networking is slower, data is read from slower object stores, and smaller GPUs are leveraged.
We will then discuss how composability allows us to map our software to the hardware we are working with. First we will discuss two of our communication protocols one leveraging UCX and the other leveraging TCP with boost asio and how these perform on our two respective systems. After this we will discuss two different abstractions for reading bytes from input files. The first leverages GPU Direct Storage with Weka in an on prem deployment and the second is used to read files that are stored in S3. Lastly we will discuss using composability to target multiple hardware accelerators showing a back end agnostic task execution framework that allows you to create tasks that target multiple backends and memory spaces in a single process.


Title: What PostgreSQL Extensibility Can Teach Us About Composable Data Management Systems

Speaker: Abigale Kim, Ph.D. Student, University of Wisconsin—Madison

Speaker Bio: Abigale Kim is a PhD student with Prof. Xiangyao Yu at the University of Wisconsin—Madison, where she studies GPU accelerated database systems. Previously, she was a Master’s student with Prof. Andrew Pavlo at Carnegie Mellon University, where she researched database system extensibility. She has also worked on various database systems in industry (TileDB, YugabyteDB, AWS Redshift). She completed her undergraduate degree at Carnegie Mellon University with a concentration in computer systems.

Abstract: Many well-known database systems allow users to write extensions, which is custom code that adds new features to a DBMS while maintaining its core functionality and infrastructure. Our analysis of database system extensibility reveals that PostgreSQL has the most flexible extension API. PostgreSQL’s extension API allows users to override much of the core DBMS’s functionality, including the entire query execution layer. Some extensions (e.g., Citus, Timescale, Apache Age) are essentially unique systems that utilize the PostgreSQL codebase structure as a skeleton. Analyzing these extensions and the APIs exposed by PostgreSQL that allow these extensions to exist yields many valuable insights about composability. In this talk, we will cover the lessons that PostgreSQL’s extension API provides to database system composability.