I have fallen a bit behind on my 100 days to offload pace. Certainly not from lack of interesting discoveries or happenings, rather too many overlapping my time to write. I’m hoping to get back on track with this post about Delta Lake, databricks open-sourced data lakehouse implementation. This discussion surrounds the paper Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. I highly recommend the read, in doing so I gained deeper insight into the direction of data storage including specific industry use-cases and the relationships between storage solutions thereof.
The problem this work addresses is that traditional object stores offer very little beyond simple key-value queries. This has spurred widespread integration with columnar file formats (ex. Parquet) which offer more acceptable performance than simple scans. However, more complex operations still suffer. Specifically, as data integrations scale, single entities begin to span multiple files and multi-object operations become essential. The lack of atomic operations in an object store environment means that intermediate readers may see partial updates or worse, a system crash can leave the data in a corrupted state.
Delta Lake proposes a data lakehouse paradigm as an elegant blending of cloud-based storage solutions. From an operational standpoint, this claims to be a viable replacement (in many cases) for a data lake (ex. s3) for archival storage, data warehouse (ex. Redshift) for complex queries requiring indexes, and message queues (ex. Apache Kafka) for real-time computations.
At a high level, the architecture is structured as a DBMS over object stores. Individual tables are represented by a collection of files within a directory hierarchy. These include a write-ahead log, which contains a sequence of actions such as metadata changes (ex. schema, etc), data additions or removals, and protocol updates specific for that table. The write-ahead log is periodically compressed, with optimistic concurrency, and stored into table checkpoints, which are columnar oriented files (ie. Parquet) offering more efficient data retrievals.
Without repeating the entire paper (again, I will reiterate that it is a great read), the architectural choices ensure ACID transactions over object stores, effectively blending many of the advantages of both data lakes and data warehouses. Higher-level features include:
- Time Travel and Rollbacks: The use of immutable objects make querying past data seamless.
- Efficient UPSERT, DELETE, and MERGE: ACID transactions mean modifying multiple tables may be performed atomically.
- Stream Ingest and Consumption: The write-ahead table logs may serve as a message queue.
- Data Layout Optimization: Object size has a significant impact on read performance - want table checkpoints to be large, but not too large. Additionally, support for z-order curves allow sequential reads over similar data.
- Improving Cache Performance: Immutable data means that table checkpoints can be stored in memory in entirety.
- Many more …
Simply, I think this is really great work. It seems a natural extension of object stores similar to solutions to improve queryability over distributed file systems. For example, Apache Hive providing SQL-like semantics over HDFS. Conceptually, my work on NahFS purports comparable support for queryability of large data by spatiotemporal attributes. This work obviously goes far beyond mine, offering a more generalized approach which is more powerful inter-domain.
The main strength is a “have your cake and eat it too” approach. Data lakes provide efficient archival storage for large reads and data warehouses offer performance over complex queries. This work blends the architectures, providing ACID transactions over object stores with relatively low performance overhead.
Identifying weaknesses requires pretty careful observation. Most prominent for me is a lack of performance benchmarks for specific use-cases. During my academic tenure, I unfortunately learned to analyze papers in the information they do not include as much as that they do. I would have loved to see deeper insight into (1) performance and applicability into streaming workloads, it seems the support for this is fairly loose and (2) more information on the consistency timeliness. The trade-off space in this approach is consistency and queryability. The system is eventually consistent, which eases support for ACID transactions, but means that data is not always available right after insertion. It would be useful to understand metrics surrounding the consistency guarantees.
7 day(s) offloaded in the 100DaysToOffload challenge.