Hubbry Logo
search
logo
637863

Log-structured merge-tree

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia
Log-structured merge-tree
TypeHybrid (two tree-like components)
Invented1996
Invented byPatrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil
Time complexity in big O notation
Operation Average Worst case
Insert O(T * L / B)
Find-min O(1) O(L * e^(-M/N))
Space complexity
Space O((T + 1) / T)

In computer science, the log-structured merge-tree (also known as LSM tree, or LSMT[1]) is a data structure with performance characteristics that make it attractive for providing indexed access to files with high insert volume, such as transactional log data. LSM trees, like other search trees, maintain key-value pairs. LSM trees maintain data in two or more separate structures, each of which is optimized for its respective underlying storage medium; data is synchronized between the two structures efficiently, in batches.

One simple version of the LSM tree is a two-level LSM tree.[2] As described by Patrick O'Neil, a two-level LSM tree comprises two tree-like structures, called C0 and C1. C0 is smaller and entirely resident in memory, whereas C1 is resident on disk. New records are inserted into the memory-resident C0 component. If the insertion causes the C0 component to exceed a certain size threshold, a contiguous segment of entries is removed from C0 and merged into C1 on disk. The performance characteristics of LSM trees stem from the fact that each component is tuned to the characteristics of its underlying storage medium, and that data is efficiently migrated across media in rolling batches, using an algorithm reminiscent of merge sort. Such tuning involves writing data in a sequential manner as opposed to as a series of separate random access requests. This optimization reduces total seek time in hard-disk drives (HDDs) and latency in solid-state drives (SSDs).

Diagram illustrating compaction of data in a log-structured merge tree
Diagram illustrating compaction of data in a log-structured merge tree

Most LSM trees used in practice employ multiple levels. Level 0 is kept in main memory, and might be represented using a tree. The on-disk data is organized into sorted runs of data. Each run contains data sorted by the index key. A run can be represented on disk as a single file, or alternatively as a collection of files with non-overlapping key ranges. To perform a query on a particular key to get its associated value, one must search in the Level 0 tree and also each run. The Stepped-Merge version of the LSM tree[3] is a variant of the LSM tree that supports multiple levels with multiple tree structures at each level.

A particular key may appear in several runs, and what that means for a query depends on the application. Some applications simply want the newest key-value pair with a given key. Some applications must combine the values in some way to get the proper aggregate value to return. For example, in Apache Cassandra, each value represents a row in a database, and different versions of the row may have different sets of columns.[4]

In order to keep down the cost of queries, the system must avoid a situation where there are too many runs.

Extensions to the 'leveled' method to incorporate B+ tree structures have been suggested, for example bLSM[5] and Diff-Index.[6] LSM-tree was originally designed for write-intensive workloads. As increasingly more read and write workloads co-exist under an LSM-tree storage structure, read data accesses can experience high latency and low throughput due to frequent invalidations of cached data in buffer caches by LSM-tree compaction operations. To re-enable effective buffer caching for fast data accesses, a Log-Structured buffered-Merged tree (LSbM-tree) is proposed and implemented.[7]

Operations

[edit]

Write

[edit]

In LSM-trees, write operations are designed to optimize performance by reducing random I/O and leveraging sequential disk writes. When a write operation is initiated, the data is first buffered in an in-memory component, often implemented using a sorted data structure such as a Skip list or B+ tree.

Once the in-memory buffer becomes full, it is flushed to the disk as an immutable sorted component at the first level (C1). This flush is performed sequentially, ensuring high I/O efficiency by avoiding the costly random writes typical of traditional indexing methods. To maintain durability, the system may use a write-ahead log (WAL) that records all incoming writes before they are added to the memory buffer. This ensures that no data is lost in the event of a crash during a write.

As data accumulates across levels on the disk, LSM trees employ a merge process similar to merge sort to consolidate entries and maintain a consistent sorted order across levels. This process also handles updates and deletes, removing redundant or obsolete entries. Updates are treated as new writes, while deletes are marked with a tombstone entry, which is a placeholder indicating that the key has been deleted. These tombstones are later removed during the merging process.

Two common merging policies govern how data flows through the levels: levelling and tiering. In levelling, only one component exists per level, and merging happens more frequently, reducing the total number of components but increasing write amplification. In tiering, multiple components can coexist within a level, and merging occurs less frequently, reducing write amplification but increasing read costs because more components need to be searched.

This design leads to amortized write costs of for leveling merge policies, where is the size ratio between levels, is the number of levels, and is the number of entries per page.[8]

Point lookup

[edit]

A point lookup operation retrieves the value associated with a specific key. In LSM trees, because of the multi-level structure and the immutability of disk components, point lookups involve checking several levels to ensure the most up-to-date value for the key is returned.

The process begins with a search in the in-memory component (C0), which holds the latest data. Since this component is organized as a sorted structure, the search is efficient. If the key is found here, its value is returned right away.

If the key isn’t found in memory, the search moves to the disk components, starting with the first level (C1) and continuing through deeper levels (C2, C3, etc.). Each disk level is also sorted, allowing for efficient searches using methods like binary search or tree search. Newer levels are checked first because they contain the most recent updates and deletions for the key.

To make the search faster, LSM trees often use a bloom filter for each on-disk component. These filters are probabilistic structures that help quickly determine whether a key is definitely absent from a component. Before accessing any disk component, the system checks the Bloom filter. If the filter indicates the key is not present, the component is skipped, saving time. If the filter suggests the key might be there, the system proceeds to search for the component.

The lookup operation stops as soon as the key is found, ensuring the most recent version is returned. If the search reaches the final level without finding the key, the system concludes that the key doesn’t exist.

Point lookup complexity is without Bloom filters, as each level must be searched. With Bloom filters, the cost for zero-result lookups is significantly reduced to , where M is the total Bloom filter size and N is the number of keys. For existing key lookups, the cost is due to the presence of Bloom filters.[8]

Range query

[edit]

A range query retrieves all key-value pairs within a specified range. The operation involves scanning multiple components across the LSM tree's hierarchical structure and consolidating entries to ensure accurate and complete results.

A range query begins by searching the in-memory component (C0). Since this component is typically a sorted data structure, range queries can be done efficiently. Once the in-memory component is finished, the query proceeds to the disk components, starting from the first level (C1) and continuing to deeper levels.

Each disk component is also stored as a sorted structure allowing efficient range scans within individual components. To perform the range query, the system locates the starting point in each relevant component and scans sequentially until the end of the range is reached. The results from each component are then merged into a priority queue to reconcile duplicates, updates, and deletes, ensuring the final result only includes the latest version of each key.

For efficiency, LSM trees often divide disk components into smaller, disjoint key ranges. In this way, when processing a range query, the system can search only the partitions that have overlap ranges to reduce the number of components accessed. Similar to point lookup, Bloom filters are sometimes used to quickly decide whether a disk component contains any keys within the queried range, allowing the system to skip components that are guaranteed to be irrelevant.

The performance of a range query in LSM-trees depends on the query's selectivity. For short-range queries, which access fewer keys than twice the number of levels, the query must examine components across all levels, leading to a cost proportional to the number of levels. For long-range queries, which access many keys, the price is dominated by scanning the largest level, as it contains most of the data. For this reason, the runtime for short-range queries is and for long-range queries, where is the number of keys in the range.[8]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Log-structured merge-tree (LSM-tree) is a disk-based indexing data structure designed to enable efficient, low-cost insertions and deletions for files experiencing high rates of update operations, by batching changes in memory and cascading them through a series of sorted, immutable components on disk via periodic merge processes.[1] Introduced in 1996 by Patrick O'Neil, Elizabeth O'Neil, Edward Cheng, and Dieter Gawlick, the LSM-tree addresses the limitations of traditional structures like B-trees, which suffer from expensive random disk I/O during frequent updates, by instead favoring sequential appends to log files and background merging to maintain sorted order.[1] At its core, an LSM-tree organizes data across multiple levels: an in-memory write buffer (often called a memtable) captures incoming writes until it reaches capacity, at which point it is flushed to an immutable on-disk component (typically as sorted string tables or SSTables), and subsequent merges propagate data to deeper, larger levels to resolve overlaps and deletions.[2] This tiered architecture, with each level exponentially larger than the previous, minimizes write amplification while supporting tunable trade-offs between write throughput and read performance through compaction strategies like leveled or tiered merging.[2] The LSM-tree's primary advantages include dramatically reduced I/O costs—up to two orders of magnitude lower than B-trees for insert-heavy workloads—due to its reliance on multi-page block reads and sequential writes, making it particularly suitable for applications like time-series logging, audit trails, and high-volume append-only data.[1] However, it incurs higher read latencies for point queries and scans, as these may require searching multiple components or bloom filters to skip irrelevant data, and compactions can introduce temporary write stalls or space overhead from duplicates before merging.[2] Since its inception, the LSM-tree has become foundational to numerous production storage systems, powering write-optimized key-value stores such as Google's LevelDB (2011), Facebook's RocksDB (2012), and Apache Cassandra (2008), where it enables scalable handling of petabyte-scale datasets with sustained ingestion rates exceeding millions of operations per second.[2] Ongoing research continues to refine LSM-tree variants, incorporating techniques like learned indexes and hybrid memory integration to mitigate amplification issues and adapt to flash-based storage.[3][4]

Overview

Definition

A log-structured merge-tree (LSM-tree) is a disk-based data structure designed to provide low-cost indexing for files with high rates of record inserts and deletes over extended periods.[1] It achieves this by deferring and batching index changes, cascading them from a memory-based component through one or more disk components in a manner similar to merge sort, thereby optimizing for sequential write operations on disk.[1] At its core, the LSM-tree trades increased read latency for superior write efficiency through a tree-like hierarchy of levels, each containing multiple immutable sorted string tables (SSTables) that store key-value pairs in sorted order.[1] This structure appends new writes sequentially to avoid random I/O, with organization deferred to background processes that merge data across levels to maintain stability.[1] The LSM-tree guarantees eventual consistency for reads by propagating updates through merges, though point queries may require scanning and merging data from multiple levels to retrieve the most current value.[5] In its basic architecture, recent writes are buffered in an in-memory memtable; once full, it is flushed to disk as an immutable SSTable in level L0, with subsequent levels (L1 to Ln) holding progressively larger, more stable datasets formed by merging lower-level SSTables.[1] Higher levels are typically 10 times larger than the preceding ones, ensuring efficient scaling for write-heavy workloads.[1]

Motivation

Traditional B-tree indexes exhibit significant write amplification in environments with high random write rates, as each update often triggers page splits, index rebalancing, and multiple random disk I/Os—typically O(log N) operations per write on rotating magnetic media, leading to inefficient performance for insert-heavy workloads.[1] The log-structured merge-tree (LSM-tree) was designed to mitigate these issues through a log-structured approach that favors sequential append-only writes to disk, achieving an amortized write cost of O(1) and minimizing seek times. This strategy is especially advantageous for flash and SSD-based storage, where random writes cause uneven wear on memory cells and reduce device lifespan, enabling LSM-trees to sustain higher sustained write throughputs.[1][6] In exchange for these write optimizations, LSM-trees incur higher read complexity by requiring the merging of data from multiple sorted components on disk, but this trade-off yields write performance 10-100 times faster than B-trees for append-dominated scenarios. LSM-trees target workloads involving massive data ingestion, such as time-series monitoring, event logging, and NoSQL databases handling millions of writes per second.[1][7] A key comparison metric is space amplification, where LSM-trees can exhibit around 1.1x for leveled compaction to 2x or more for tiered compaction compared to the logical data size due to temporary key duplicates arising during the asynchronous merging process, though compaction helps mitigate this over time.[8]

History

Invention

The Log-Structured Merge-Tree (LSM-tree) was proposed in 1996 by Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil as a novel disk-based indexing structure designed to support high rates of insert and delete operations in database systems.[1] This work originated from a 1991 technical report by the same authors at the University of Massachusetts Boston Mathematics and Computer Science Department and was formally published in the journal Acta Informatica.[1] The proposal aimed to overcome the write bottlenecks inherent in traditional B-tree indexes, which require frequent random I/O updates that scale poorly with transaction volumes.[1] The invention was motivated by the need to handle high-volume updates in online transaction processing (OLTP) environments, such as those benchmarked by the TPC-A standard, where tables like the History table could grow rapidly—potentially adding millions of entries per day—without incurring the full cost of immediate index maintenance.[1] Drawing inspiration from log-structured file systems (LFS), introduced by Mendel Rosenblum and John K. Ousterhout in 1992, the LSM-tree adapts the idea of sequential logging to indexed data by deferring and batching changes to amortize I/O costs across multiple operations.[1][9] In LFS, all file system modifications are written sequentially to a log on disk, minimizing seek times and enabling efficient garbage collection; the LSM-tree extends this to key-value indexing by combining write-ahead logging with a multi-level merge process.[1] Key innovations include an in-memory buffer called the C0 component (analogous to a memtable) that accumulates recent writes before flushing them to immutable disk files in levels C1, C2, and beyond, where older data is merged in a rolling fashion to resolve overlaps and deletions.[1] This structure leverages sequential multi-page writes—similar to LFS block sizes—to reduce random I/O, achieving an expected I/O cost per insert of approximately 2 × (sequential page I/O cost) / (pages per block), compared to the higher random I/O demands of B-trees.[1] For space usage, the design anticipates efficient storage for growing datasets; for instance, indexing 576 million entries in a TPC-A-like History table over 20 days requires about 9.2 GB, including redundancy from unmerged levels that is progressively resolved.[1] These bounds highlight the LSM-tree's ability to trade some space for significantly improved write throughput in write-heavy workloads.[1]

Evolution

Following the initial proposal of LSM-trees in the mid-1990s, their practical adoption accelerated in the 2000s through integration into distributed storage systems. Google's Bigtable, released in 2006, was among the first major implementations, employing LSM-tree structures to handle structured data at petabyte scale across commodity hardware clusters.[10] This approach influenced open-source projects like Apache HBase, which modeled its column-family storage on Bigtable's LSM-based design for scalable, random read-write access to large datasets. Similarly, Apache Cassandra, launched in 2008, adopted LSM principles with SSTables for decentralized, high-availability storage, enabling fault-tolerant writes in wide-column databases. Refinements in the late 2000s and early 2010s focused on optimizing compaction and read efficiency. Earlier LSM designs, such as Bigtable's size-tiered compaction, prioritized write throughput but suffered from higher space amplification due to overlapping key ranges across tiers. In 2011, Google's LevelDB introduced leveled compaction, which enforces non-overlapping key ranges per level to cap space usage at approximately 10% amplification while trading off some write costs.[11] Concurrently, Bloom filters emerged as a key read optimization, first practically integrated in Cassandra around 2008-2009 to probabilistically filter absent keys before SSTable scans, reducing unnecessary I/O by up to 90% in point lookups. The 2010s saw broader enhancements for production workloads, particularly on flash storage. Facebook's RocksDB, forked from LevelDB in 2012, added configurable compaction styles (including leveled and universal variants), per-column-family compression (e.g., Snappy or LZ4), and SSD-specific tuning to lower write amplification from 10-20x to under 5x in typical setups by optimizing merge scheduling and buffer management.[12] These features addressed flash endurance limits, enabling LSM-trees in high-velocity applications like social graphs. Recent developments through 2025 have emphasized hardware acceleration and intelligent structures for cloud-native scalability. Integration with NVMe SSDs has reduced tail latencies by exploiting parallel I/O queues, as demonstrated in 2021 designs achieving 2-5x faster throughput over SATA-based LSM stores without sacrificing durability.[13] Hybrid models blending LSM persistence with in-memory caching gained traction, notably in Apache Ignite 3 (2021), which layers RocksDB-backed LSM-trees under a distributed memory grid for unified transactional and analytical processing at terabyte scales.[14] Meanwhile, research from 2020-2024 on learned indexes has targeted merge overhead reduction; for instance, the 2020 BOURBON system uses neural networks to approximate index lookups in LSM levels, improving point lookup performance by 1.23× to 1.78× while maintaining query accuracy.[15] In 2025, further advances include HotRAP for efficient hot data retention and promotion in LSM-trees with under 4% overhead on uniform workloads, and TieredKV, a tiered LSM design integrating learned indexes for superior throughput and reduced amplification.[16][17] These advances have extended LSM-trees' viability to elastic, multi-tenant cloud databases, filling gaps in post-2010 scalability for diverse storage media.

Data structure

Memtable

The memtable serves as the primary in-memory buffer in an LSM-tree, capturing incoming writes to achieve high-throughput ingestion without immediate disk I/O. It is typically realized as a sorted data structure, such as a skip list, red-black tree, or AVL tree, which stores key-value pairs in logarithmic time for insertions and supports efficient range scans.[18] These structures ensure that data remains ordered by key, facilitating subsequent merging with on-disk components.[19] In the write path, updates are appended sequentially to the memtable: insertions add new key-value pairs, updates overwrite existing entries for the same key, and deletions insert a tombstone to mark the key as invalid without immediate removal.[20] This append-only approach avoids costly in-place modifications, converting random writes into efficient sequential operations in memory.[18] The memtable is engineered for thread-safety, often using lock-free or fine-grained locking mechanisms to handle concurrent writes from multiple threads without significant contention.[21] Durability is maintained by first logging each write to a write-ahead log (WAL) on persistent storage before committing it to the memtable, enabling recovery of unpersisted data in the event of a crash by replaying the WAL.[18] Common configurations set the memtable capacity between 64 MB and 256 MB, balancing memory footprint with performance; for instance, RocksDB defaults to 64 MB per memtable.[22] Upon reaching capacity, the active memtable is sealed as immutable, a new mutable memtable is initialized for ongoing writes, and the immutable one is flushed to disk as a level-0 sorted string table (SSTable).[23] This process triggers background serialization without blocking new writes, though frequent flushes can introduce temporary stalls if memory pressure builds multiple immutable memtables.[24] The memtable's size configuration embodies key memory trade-offs: larger allocations decrease flush frequency and mitigate write stalls by amortizing I/O overhead, but they increase RAM consumption and potential recovery time from failures.[24] Optimizing this balance is critical for workloads with varying write rates, as excessive flushing amplifies disk writes while undersized buffers strain available memory.[20]

SSTables and levels

SSTables, or Sorted String Tables, serve as the fundamental on-disk storage units in LSM-tree implementations, consisting of immutable, append-only files that store key-value pairs in sorted order by key.[12] Each SSTable is typically sized between 1 and 64 MB to balance I/O efficiency and manageability, and it is divided into fixed-size data blocks (often around 4 KB) for sequential reads, with the entire file compressed using algorithms such as Snappy or Zstandard to reduce storage overhead.[12] Once written, an SSTable is never modified, ensuring write-once semantics that align with the log-structured nature of the LSM-tree. In the LSM-tree's hierarchical structure, SSTables are organized into multiple levels, denoted as L0 through L_max (commonly up to 7-10 levels), where each level represents a progressively larger and more stable layer of data.[12] Level L0 contains SSTables directly flushed from the in-memory memtable, and these files may overlap in their key ranges, potentially covering the same keys multiple times due to recent writes.[1] Starting from L1 and higher, SSTables are non-overlapping within a level, with each subsequent level designed to hold approximately 10 times the total data size of the previous one, creating an exponential growth that accommodates the full dataset while minimizing redundancy.[1] This tiered arrangement, inspired by the component hierarchy in the original LSM design (C0 for memory, C1+ for disk), ensures that newer data resides in upper levels and older, more persistent data in lower ones.[1] Data placement across levels follows a range-partitioning scheme, where keys are divided into disjoint intervals to cover the entire key space efficiently.[12] In L0, the overlapping nature allows for quick ingestion of new data without immediate reorganization, while L1 and beyond enforce strict non-overlap, with each SSTable responsible for a unique key range that tiles the key space without gaps or significant redundancies.[12] Higher levels thus provide broad coverage of the key space with minimal duplication, facilitating scalable storage as the dataset grows.[1] For efficient querying, each SSTable includes internal indexing structures, such as block indexes that map key prefixes to specific data blocks, enabling fast seeking without scanning the entire file.[12] The file concludes with a footer containing metadata like block offsets, checksums, and pointers to auxiliary structures, while optional Bloom filters—probabilistic data structures—are embedded to quickly determine if a key is absent from the SSTable, avoiding unnecessary I/O for irrelevant files.[25] These mechanisms, building on the sorted file indexes from early LSM designs, support logarithmic-time access by allowing reads to probe only a small number of SSTables per level.[1] The space usage in this leveled organization results in a total amplification factor calculated as the sum of the sizes across all levels divided by the size of unique data, $ S = \sum_{i=0}^{L_{\max}} \frac{\text{size}_i}{\text{unique data}} $, which typically ranges from 2x to 4x depending on update frequency and compaction efficiency.[1] This overhead arises from temporary duplicates during merging but remains low compared to alternatives like B-trees, as exponential sizing ensures that lower levels dominate storage without proportional redundancy.[12]

Merging process

The merging process in a log-structured merge-tree (LSM-tree) operates as an incremental, rolling merge analogous to external merge sort, where multiple sorted runs of key-value pairs are combined across levels, resolving duplicates and tombstones by retaining only the most recent value for each key.[1][26] Level transitions are triggered when the size of level LiL_i exceeds a predefined threshold, typically around 10 times the maximum size of level Li1L_{i-1}, prompting a minor compaction that selects and merges the smallest SSTables from LiL_i with overlapping SSTables in Li+1L_{i+1} to propagate data downward.[1][26] In level L0, key-range overlaps are common and tolerated across up to hundreds of SSTables, with read operations resolving them by selecting the latest value; in contrast, higher levels under leveled compaction maintain disjoint, non-overlapping key ranges that are fully sorted to avoid such redundancy.[26] This multi-level propagation results in an amortized I/O write cost per key of O(logN/B)O(\log N / B), where NN is the total number of keys and BB is the block size, as each key is rewritten once per level during merges.[26] LSM-tree implementations vary in merging strategy, with tiered compaction permitting multiple overlapping runs per level to favor write throughput and leveled compaction enforcing a single run per level for better space efficiency and read performance (detailed in implementations); the process executes asynchronously via background threads to ensure it does not block foreground write operations.[1][26]

Operations

Write operation

In an LSM-tree, the write operation for inserting or updating a key-value pair begins by appending the entry to a write-ahead log (WAL) on disk to ensure durability in case of a crash.[1] This step guarantees that the data can be recovered and replayed into the in-memory structure if needed, providing fault tolerance without immediate random I/O overhead.[1] Next, the entry is inserted into the memtable, an in-memory balanced tree structure such as a skip list or red-black tree, which supports efficient lookups and updates.[6] The insertion time complexity is O(log M), where M is the size of the memtable, enabling low-latency operations typically under 1 ms. Updates to existing keys are handled by overwriting the prior value in the memtable, while deletes insert a special deletion marker known as a tombstone, which logically removes the key without immediate physical erasure. When the memtable reaches a configured size threshold, it is frozen—made immutable to new writes—and a new active memtable is created to continue accepting incoming operations, ensuring concurrency without blocking.[27] The frozen memtable is then flushed sequentially to disk as a new immutable sorted string table (SSTable) at level L0, amortizing the cost over batched writes and avoiding random I/O.[1] This sequential flushing contributes to high write throughput by leveraging the storage device's strengths in sequential access.[6] The write amplification (WA), defined as the ratio of total bytes written to disk versus user data bytes, arises from repeated copying during subsequent merges and is approximated by
WAlogLN, \text{WA} \approx \log_L N,
where L is the fan-out factor per level (typically 10) and N is the total number of keys in the tree.[6] This metric highlights the efficiency of the append-only write path, though it increases with tree size. Subsequent compaction of these L0 SSTables into higher levels manages overlap but is handled separately.[28]

Read operation

In LSM-trees, point lookups for a specific key begin by searching the in-memory memtable, typically implemented as a balanced tree structure such as an AVL or red-black tree, which provides O(log M) time complexity where M is the number of entries in the memtable.[1] If the key is not found, the search proceeds to the Level 0 (L0) SSTables on disk, which may require scanning multiple files in configurations like RocksDB, though Bloom filters are used to skip irrelevant files with over 90% efficiency by probabilistically determining if a key is absent.[29][30] For higher levels (L1 and beyond), the process involves a binary search across the sorted list of SSTables to identify candidate files—O(log N) where N is the total number of files—followed by a binary search within each file's index, incurring O(log B) time per file where B is the block size.[30] Range queries operate similarly but extend to scanning contiguous key ranges across relevant components; iterators are created for the memtable and each qualifying SSTable, then merged in key order to produce a sorted result set, ensuring all overlapping files from memtable to the highest level are examined.[1] This merging leverages the sorted nature of SSTables, with optimizations like cached block indexes in memory to minimize disk seeks by mapping keys directly to data blocks.[31] Key optimizations further enhance read efficiency: Bloom filters, built per SSTable during creation, enable skipping non-matching files for point lookups, while in-memory caching of file indexes and data blocks reduces repeated I/O for hot keys.[30] The worst-case latency for a point lookup involves O(L × log B) I/O operations, where L is the number of levels (typically around 7 in practical systems like RocksDB), but with Bloom filters and caching, average latencies are significantly reduced.[30][31] For consistency, reads retrieve the latest value by prioritizing the memtable and descending through levels, selecting the most recent version encountered due to the append-only, immutable nature of components, which eliminates the need for read locks.[1]

Compaction

Compaction in log-structured merge-trees (LSM-trees) serves to reorganize on-disk data structures, primarily by merging sorted string tables (SSTables) across levels to reduce key overlaps, reclaim storage space occupied by tombstones (markers for deleted keys), and prevent excessive growth or bloat in individual levels that could degrade read performance and increase space amplification.[1][26] This background maintenance process ensures the tree maintains its efficiency for write-heavy workloads by resolving conflicts from multiple versions of the same key and eliminating obsolete entries that accumulate due to deferred updates.[1] Two primary compaction strategies are commonly employed: leveled and size-tiered. In the leveled strategy, SSTables are merged sequentially from one level to the next, with each level limited to a target size typically 10 times larger than the previous one; this approach aggressively reduces overlaps by ensuring non-overlapping key ranges within a level, but it incurs higher write amplification, often around 10× the ingested data size due to repeated rewriting of data across levels.[29][26] Conversely, the size-tiered strategy groups SSTables by similar sizes within a level before compacting them into larger files in the next level, which allows multiple overlapping files per key range and results in lower write amplification, approximately 4×, at the expense of increased read amplification from more frequent key lookups across files.[32][26] The compaction process operates in background threads to avoid interfering with foreground operations, selecting candidate SSTables based on a score metric that prioritizes levels or files with the highest saturation, such as score = total file size / target level size.[29] Overlapping files are then read, their keys merged in sorted order—resolving duplicates, applying tombstones to remove deleted entries, and discarding superseded values—and the resulting sorted data is written as new, larger SSTables to a higher level, with old files marked for deletion once the new ones are durable.[1][33] Compaction is computationally intensive due to the key merging step but primarily I/O-bound from reading and writing large volumes of data, with configurable heuristics like maximum background threads, minimum file sizes for inclusion, and score thresholds to balance resource usage against workload demands.[33][26] Triggers are based on compaction debt, defined as the sum of file sizes exceeding level thresholds across the tree; compaction initiates when this debt surpasses a system-defined limit to maintain space and performance invariants.[26]
D=(file sizes exceeding thresholds) D = \sum (\text{file sizes exceeding thresholds})
Compaction proceeds when $ D > \theta $, where $ \theta $ is the threshold parameter.[26]

Performance characteristics

Advantages

LSM-trees excel in write optimization by buffering updates in memory before flushing them to disk in sequential batches, which minimizes random I/O operations and enables high throughput on both hard disk drives (HDDs) and solid-state drives (SSDs).[1][34] This append-only approach contrasts with B-trees, which require in-place updates leading to frequent random writes, allowing LSM-trees to achieve insert costs approximately 50 times lower than B-trees under high-update workloads.[1] Additionally, the in-memory memtable supports low tail latency for write bursts, as operations complete quickly without immediate disk involvement.[34] In distributed systems, LSM-trees facilitate horizontal scalability through sharding, where data is partitioned across multiple nodes to handle petabyte-scale datasets without performance degradation. This design supports elastic scaling by adding nodes as data volume grows, making it ideal for large-scale key-value stores like those in NoSQL databases. LSM-trees are well-suited to modern hardware, particularly flash-based SSDs, where their typical write amplification of 10-20x is significantly lower than B-trees' 100x or more, reducing wear and extending device lifespan.[28] Compaction processes, which merge sorted files, can be parallelized across multi-core processors to improve efficiency without blocking foreground writes.[35] Fault tolerance is enhanced by the write-ahead log (WAL), which ensures durability by persisting updates before acknowledgment, allowing recovery of the memtable upon crashes.[1] The immutability of SSTables simplifies crash recovery, as files remain consistent and can be replayed from the WAL without complex in-place repairs.[1] Benchmarks, such as those using the YCSB workload, demonstrate LSM-trees achieving 5-50x faster write throughput compared to B+ trees in update-intensive scenarios, underscoring their suitability for write-heavy applications. Space efficiency is further improved during compaction, where compression techniques and duplicate removal can reduce storage overhead by up to 50% in leveled implementations.[36]

Disadvantages

While LSM-trees excel in write-heavy workloads, they incur notable penalties in read performance, particularly for point queries. Point lookups often require scanning multiple levels of sorted string tables (SSTables), leading to higher read amplification compared to B-trees, which typically access data in logarithmic time with fewer I/O operations. In benchmarks, LSM-tree-based systems like RocksDB exhibit read throughput that is 1.5x to 3x lower than B-tree implementations for random point queries, translating to latencies that are proportionally higher due to the need to merge results from overlapping levels.[37] Range scans, while more efficient than in B-trees for sequential access, still suffer from merge-heavy operations across levels, exacerbating latency in multi-level structures.[28] Space overhead represents another key limitation, stemming from data duplication during the asynchronous merging process. LSM-trees can experience space amplification of typically around 1.2x to 2x the logical data size, depending on configuration and compaction policies, primarily due to temporary storage of outdated versions and duplicates in intermediate SSTables before compaction resolves them.[31] Deletes and updates introduce tombstones—markers that propagate through levels until compacted—further inflating space usage, as these entries persist and consume resources even after the targeted data is logically removed. This lag in cleanup can lead to sustained overhead, especially in workloads with high update rates, where tombstones can significantly inflate space usage until compaction removes them.[26] Compaction, essential for maintaining performance, introduces maintenance overheads that can disrupt operations. Background compaction processes cause I/O spikes known as write stalls, where incoming writes are throttled if levels become imbalanced, potentially increasing tail latencies by orders of magnitude during peak activity.[7] Tuning parameters like size ratios and compaction policies is required to mitigate these "storms," but suboptimal configurations can interfere with foreground reads and writes, reducing overall throughput.[38]

Implementations

Key databases

Apache Cassandra, first released in 2008, is an open-source distributed NoSQL wide-column store that employs LSM-trees as its core storage engine to handle high-write-throughput workloads across clusters. It utilizes size-tiered compaction strategies to manage SSTables, enabling efficient data partitioning and replication in decentralized environments.[39][40] Apache HBase, initially released in 2007, is a distributed, scalable, big data store modeled after Google's Bigtable and built on Hadoop's HDFS, incorporating LSM-tree principles for storing sparse data tables in a master-slave architecture. It supports tiered and date-tiered compaction policies within its region-based storage model to optimize for column-family oriented data access.[41][42][43] LevelDB, developed by Google and released in 2011, is an embeddable persistent key-value store that implements a leveled LSM-tree structure, popularizing leveled compaction where each level holds approximately ten times more data than the previous one. It focuses on simple operations like puts, gets, and scans, making it suitable for lightweight, high-performance storage needs.[11][44] RocksDB, a 2012 fork of LevelDB developed by Facebook, serves as an optimized LSM-tree-based key-value store tailored for fast storage media like SSDs, featuring advanced features such as dynamic level sizing, multiple compaction styles, and transaction support. It powers numerous production systems, including Facebook's infrastructure handling petabyte-scale data across over 30 applications.[45][46] Other notable implementations include ScyllaDB, a high-performance Cassandra-compatible database rewritten in C++ for shard-per-core architecture and low-latency LSM-tree operations, first released in 2015. TiKV, a distributed transactional key-value store written in Rust as part of the TiDB ecosystem, leverages RocksDB's LSM-tree engine for consistent, scalable storage and with its first general availability release in 2017. LSM-tree-based systems like these power thousands of production deployments worldwide, supporting diverse applications from social media to big data analytics.[47][48][6][49][50]

Variants

LSM-trees have evolved through various compaction strategies to balance write and read performance. In size-tiered compaction, also known as tiering, immutable sorted runs of similar sizes are grouped within each level before merging to the next, which reduces write amplification by allowing multiple runs to accumulate per level but increases read amplification due to potential overlaps across multiple runs during lookups.[51] Conversely, leveled compaction spreads data across non-overlapping ranges in each level, compacting runs from the previous level into a single sorted run per level, thereby minimizing read amplification and space usage at the expense of higher write amplification from more frequent full-level merges.[51] These strategies represent fundamental variants, with tiered favoring write-intensive workloads and leveled suiting read-heavy scenarios.[51] The Bw-tree, introduced by Microsoft Research in 2013, extends the LSM paradigm by layering a latch-free B-tree structure over log-structured storage, using delta updates to enable in-place modifications on pages while maintaining sequential logging for writes. This hybrid design achieves LSM-like write throughput through elastic, discontiguous pages and mapping tables, while providing B-tree efficiency for range queries and point lookups on modern hardware like multicore processors and flash storage. In the 2020s, learned LSM variants integrate machine learning models to optimize index structures and reduce scan overhead during reads. For instance, BOURBON (2020) employs learned indexes on Bloom filters and range partitioning within each LSM run, predicting key locations to skip unnecessary scans and improving lookup latency by 1.23× to 1.78× compared to state-of-the-art LSM implementations like WiscKey.[15] These approaches leverage neural networks trained on key distributions to approximate sorted arrays, adapting dynamically to data patterns for better space efficiency and query speed in write-optimized settings. More recent variants, such as HotRAP (2025) for handling hot data in tiered LSM-trees and EcoTune (2025) for optimized compaction policies, continue to address performance trade-offs in modern workloads.[16][52] Hybrid approaches address specialized storage constraints, such as zoned namespaces in SSDs and shingled magnetic recording (SMR) drives. Zoned LSM designs, emerging around 2022, adapt compaction to sequential-write zones by aligning merges with zone boundaries, minimizing random I/O and extending device lifespan; for example, lifetime-leveling compaction predicts run lifetimes to distribute writes evenly across zones, reducing space amplification by approximately 30% and achieving up to 1.7× better performance compared to traditional LSM designs on ZNS SSDs.[53] In-memory augmented variants like PebbleDB enhance LSM with optimized memtables and concurrent compaction, prioritizing stability and throughput for embedded use in distributed systems while maintaining RocksDB compatibility.[54]

References

User Avatar
No comments yet.