Recent from talks
Nothing was collected or created yet.
Rope (data structure)
View on Wikipedia
In computer programming, a rope, or cord, is a data structure composed of smaller strings that is used to efficiently store and manipulate longer strings or entire texts. For example, a text editing program may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done efficiently.[1]
Description
[edit]A rope is a type of binary tree where each leaf (end node) holds a string of manageable size and length (also known as a weight), and each node further up the tree holds the sum of the lengths of all the leaves in its left subtree. A node with two children thus divides the whole string into two parts: the left subtree stores the first part of the string, the right subtree stores the second part of the string, and a node's weight is the length of the first part.
For rope operations, the strings stored in nodes are assumed to be constant immutable objects in the typical nondestructive case, allowing for some copy-on-write behavior. Leaf nodes are usually implemented as basic fixed-length strings with a reference count attached for deallocation when no longer needed, although other garbage collection methods can be used as well.
Operations
[edit]In the following definitions, N is the length of the rope, that is, the weight of the root node. These examples are defined in the Java programming language.
Collect leaves
[edit]- Definition: Create a stack S and a list L. Traverse down the left-most spine of the tree until reaching a leaf l', adding each node n to S. Add l' to L. The parent of l' (p) is at the top of the stack. Repeat the procedure for p's right subtree.
package org.wikipedia.example;
import java.util.ArrayDeque;
import java.util.Deque;
import java.util.Iterator;
import jakarta.annotation.NonNull;
class RopeLike {
private RopeLike left;
private RopeLike right;
public RopeLike(RopeLike left, RopeLike right) {
this.left = left;
this.right = right;
}
public RopeLike getLeft() {
return left;
}
public RopeLike getRight() {
return right;
}
}
public final class InOrderRopeIterator implements Iterator<RopeLike> {
private final Deque<RopeLike> stack;
public InOrderRopeIterator(@NonNull RopeLike root) {
stack = new ArrayDeque<>();
RopeLike c = root;
while (c != null) {
stack.push(c);
c = c.getLeft();
}
}
@Override
public boolean hasNext() {
return stack.size() > 0;
}
@Override
public RopeLike next() {
RopeLike result = stack.pop();
if (!stack.isEmpty()) {
RopeLike parent = stack.pop();
RopeLike right = parent.getRight();
if (right != null) {
stack.push(right);
RopeLike cleft = right.getLeft();
while (cleft != null) {
stack.push(cleft);
cleft = cleft.getLeft();
}
}
}
return result;
}
}
Rebalance
[edit]- Definition: Collect the set of leaves L and rebuild the tree from the bottom-up.
import java.util.List;
static boolean isBalanced(RopeLike r) {
int depth = r.depth();
if (depth >= FIBONACCI_SEQUENCE.length - 2) {
return false;
}
return FIBONACCI_SEQUENCE[depth + 2] <= r.weight();
}
static RopeLike rebalance(RopeLike r) {
if (!isBalanced(r)) {
List<RopeLike> leaves = Ropes.collectLeaves(r);
return merge(leaves, 0, leaves.size());
}
return r;
}
static RopeLike merge(List<RopeLike> leaves) {
return merge(leaves, 0, leaves.size());
}
static RopeLike merge(List<RopeLike> leaves, int start, int end) {
int range = end - start;
switch (range) {
case 1:
return leaves.get(start);
case 2:
return return new RopeLikeTree(leaves.get(start), leaves.get(start + 1));
default:
int mid = start + (range / 2);
return new RopeLikeTree(merge(leaves, start, mid), merge(leaves, mid, end));
}
}
Insert
[edit]- Definition:
Insert(i, S’): insert the string S’ beginning at position i in the string s, to form a new string C1, ..., Ci, S', Ci + 1, ..., Cm. - Time complexity: .
This operation can be done by a Split() and two Concat() operations. The cost is the sum of the three.
import javafx.util.Pair;
public Rope insert(int idx, CharSequence sequence) {
if (idx == 0) {
return prepend(sequence);
} else if (idx == length()) {
return append(sequence);
} else {
Pair<RopeLike, RopeLike> lhs = base.split(idx);
return new Rope(Ropes.concat(lhs.getKey().append(sequence), lhs.getValue()));
}
}
Index
[edit]
- Definition:
Index(i): return the character at position i - Time complexity:
To retrieve the i-th character, we begin a recursive search from the root node:
@Override
public int indexOf(char ch, int startIndex) {
if (startIndex > weight) {
return right.indexOf(ch, startIndex - weight);
} else {
return left.indexOf(ch, startIndex);
}
}
For example, to find the character at i=10 in Figure 2.1 shown on the right, start at the root node (A), find that 22 is greater than 10 and there is a left child, so go to the left child (B). 9 is less than 10, so subtract 9 from 10 (leaving i=1) and go to the right child (D). Then because 6 is greater than 1 and there's a left child, go to the left child (G). 2 is greater than 1 and there's a left child, so go to the left child again (J). Finally 2 is greater than 1 but there is no left child, so the character at index 1 of the short string "na" (ie "n") is the answer. (1-based index)
Concat
[edit]
- Definition:
Concat(S1, S2): concatenate two ropes, S1 and S2, into a single rope. - Time complexity: (or time to compute the root weight)
A concatenation can be performed simply by creating a new root node with left = S1 and right = S2, which is constant time. The weight of the parent node is set to the length of the left child S1, which would take time, if the tree is balanced.
As most rope operations require balanced trees, the tree may need to be re-balanced after concatenation.
Split
[edit]
- Definition:
Split (i, S): split the string S into two new strings S1 and S2, S1 = C1, ..., Ci and S2 = Ci + 1, ..., Cm. - Time complexity:
There are two cases that must be dealt with:
- The split point is at the end of a string (i.e. after the last character of a leaf node)
- The split point is in the middle of a string.
The second case reduces to the first by splitting the string at the split point to create two new leaf nodes, then creating a new node that is the parent of the two component strings.
For example, to split the 22-character rope pictured in Figure 2.3 into two equal component ropes of length 11, query the 12th character to locate the node K at the bottom level. Remove the link between K and G. Go to the parent of G and subtract the weight of K from the weight of D. Travel up the tree and remove any right links to subtrees covering characters past position 11, subtracting the weight of K from their parent nodes (only node D and A, in this case). Finally, build up the newly orphaned nodes K and H by concatenating them together and creating a new parent P with weight equal to the length of the left node K.
As most rope operations require balanced trees, the tree may need to be re-balanced after splitting.
import javafx.util.Pair;
public Pair<RopeLike, RopeLike> split(int index) {
if (index < weight) {
Pair<RopeLike, RopeLike> split = left.split(index);
return Pair.of(rebalance(split.getKey()), rebalance(new RopeLikeTree(split.getValue(), right)));
} else if (index > weight) {
Pair<RopeLike, RopeLike> split = right.split(index - weight);
return Pair.of(rebalance(new RopeLikeTree(left, split.getKey())), rebalance(split.getValue()));
} else {
return Pair.of(left, right);
}
}
Remove
[edit]- Definition:
Remove(i, j): remove the substring Ci, …, Ci + j − 1, from s to form a new string C1, …, Ci − 1, Ci + j, …, Cm. - Time complexity: .
This operation can be done by two Split() and one Concat() operation. First, split the rope in three, divided by i-th and i+j-th character respectively, which extracts the string to remove in a separate node. Then concatenate the other two nodes.
import javafx.util.Pair;
@Override
public RopeLike remove(int start, int length) {
Pair<RopeLike, RopeLike> lhs = split(start);
Pair<RopeLike, RopeLike> rhs = split(start + length);
return rebalance(new RopeLikeTree(lhs.getKey(), rhs.getValue()));
}
Report
[edit]- Definition:
Report(i, j): output the string Ci, …, Ci + j − 1. - Time complexity:
To report the string Ci, …, Ci + j − 1, find the node u that contains Ci and weight(u) >= j, and then traverse T starting at node u. Output Ci, …, Ci + j − 1 by doing an in-order traversal of T starting at node u.
Comparison with monolithic arrays
[edit]| Operation | Rope | String |
|---|---|---|
| Index[1] | O(log n) | O(1) |
| Split[1] | O(log n) | O(1) |
| Concatenate | O(1) amortized, O(log n) worst case[citation needed] | O(n) |
| Iterate over each character[1] | O(n) | O(n) |
| Insert[2][failed verification] | O(log n) | O(n) |
| Append[2][failed verification] | O(1) amortized, O(log n) worst case | O(1) amortized, O(n) worst case |
| Remove | O(log n) | O(n) |
| Report | O(j + log n) | O(j) |
| Build | O(n) | O(n) |
Advantages:
- Ropes enable much faster insertion and deletion of text than monolithic string arrays, on which operations have time complexity O(n).
- Ropes do not require O(n) extra memory when operated upon (arrays need that for copying operations).
- Ropes do not require large contiguous memory spaces.
- If only nondestructive versions of operations are used, rope is a persistent data structure. For the text editing program example, this leads to an easy support for multiple undo levels.
Disadvantages:
- Greater overall space use when not being operated on, mainly to store parent nodes. There is a trade-off between how much of the total memory is such overhead and how long pieces of data are being processed as strings. The strings in example figures above are unrealistically short for modern architectures. The overhead is always O(n), but the constant can be made arbitrarily small.
- Increase in time to manage the extra storage
- Increased complexity of source code; greater risk of bugs
This table compares the algorithmic traits of string and rope implementations, not their raw speed. Array-based strings have smaller overhead, so (for example) concatenation and split operations are faster on small datasets. However, when array-based strings are used for longer strings, time complexity and memory use for inserting and deleting characters becomes unacceptably large. In contrast, a rope data structure has stable performance regardless of data size. Further, the space complexity for ropes and arrays are both O(n). In summary, ropes are preferable when the data is large and modified often.
See also
[edit]- The Cedar programming environment, which used ropes "almost since its inception"[1]
- The Model T enfilade, a similar data structure from the early 1970s
- Gap buffer, a data structure commonly used in text editors that allows efficient insertion and deletion operations clustered near the same location
- Piece table, another data structure commonly used in text editors
References
[edit]This article relies largely or entirely on a single source. (September 2011) |
- ^ a b c d e Boehm, Hans-J; Atkinson, Russ; Plass, Michael (December 1995). "Ropes: an Alternative to Strings" (PDF). Software: Practice and Experience. 25 (12). New York, NY, USA: John Wiley & Sons, Inc.: 1315–1330. doi:10.1002/spe.4380251203. Archived from the original on 2020-03-08.
- ^ a b "Rope Implementation Overview". www.sgi.com. Archived from the original on 2017-12-19. Retrieved 2017-03-01.
External links
[edit]- "absl::Cord" implementation of ropes within The Abseil library
- "C cords" implementation of ropes within the Boehm Garbage Collector library
- SGI C++ specification for ropes (supported by STLPort and libstdc++)
- Ropes for C#
- ropes for Common Lisp
- Ropes for Java
- String-Like Ropes for Java
- Ropes for JavaScript
- Ropes for Limbo
- ropes for Nim
- Ropes for OCaml
- pyropes for Python
- Ropes for Smalltalk
- SwiftRope for Swift
- "Ropey" for Rust
- Rope for Dart
- Rope & SumTree in Zed Editor
Rope (data structure)
View on Grokipediarope<T, Alloc> class that supports scalable string operations, including constant-time iterator traversal and substring handling, directly inspired by the original design.[2] They are particularly valuable in text processing environments where frequent structural changes occur without needing to rebuild entire strings.[1]
Overview
Definition and Motivation
A rope is a data structure composed of smaller strings that efficiently stores and manipulates large strings or sequences of characters, represented as a binary tree where leaf nodes hold contiguous substrings and internal nodes denote the concatenation of their children.[1] This tree structure allows ropes to form a directed acyclic graph through shared subtrees, enabling persistent representations without duplicating data.[1] The primary motivation for ropes arises from the inefficiencies of traditional immutable string implementations, which typically use contiguous arrays and require full copying of the data during operations like concatenation or substring extraction, leading to O(n time complexity where n is the string length.[1] Such approaches become impractical for applications involving frequent modifications, such as text editors handling large documents, where even simple insertions or deletions can trigger costly reallocations and copies, exacerbating performance issues as document sizes grow.[1] For instance, early editors like vi imposed strict line length limits due to these scalability constraints.[1] Ropes address these limitations by facilitating operations on very long strings—potentially gigabytes in size—without necessitating wholesale data copying, thus supporting O(log n) time for common tasks like accessing characters or splitting sequences.[1] In a text editor scenario, this allows appending entire paragraphs to a multi-megabyte document by simply linking new leaf nodes into the tree, avoiding the need to reallocate or copy the existing buffer.[1]Historical Development
The rope data structure was invented in 1995 by Hans-J. Boehm, Russ Atkinson, and Michael F. Plass at Xerox PARC.[3] It originated as part of the Cedar programming environment, a system developed at Xerox for structured document editing and processing, where efficient handling of large, mutable text was essential.[1] Although the foundational ideas trace back to earlier work in the Cedar ecosystem around 1982, the formal design and implementation were detailed in their seminal paper, emphasizing ropes as an immutable alternative to traditional contiguous string representations to support robust runtime performance in editing applications.[1] The key publication, "Ropes: An Alternative to Strings," appeared in Software: Practice and Experience in December 1995, describing the binary tree-based structure with weighted nodes for balanced operations.[3] The implementation integrates ropes with conservative garbage collection systems, as explored in Boehm's research, to manage string-like data without frequent allocations. This work highlighted ropes' suitability for systems requiring persistence and sharing of string data, influencing subsequent developments in memory management for text-heavy applications.[1] Adoption began with its inclusion in the Silicon Graphics (SGI) Standard Template Library (STL) implementation, providing a C++ rope class for efficient string manipulation. The structure gained further traction in libraries such as Boost.Rope, which extended the SGI design for broader use in C++ projects. In modern contexts, variants like the Rust crate ropey, developed starting in 2015 and released in stable form (version 1.0) in 2019, incorporate weighted trees for balancing insertions and deletions while supporting UTF-8 text editing in resource-constrained environments.[4] More recently, as of 2024, the Zed code editor employs a rope-based data structure augmented with a SumTree for efficient text buffering.[5] These extensions maintain the core binary tree organization but optimize for contemporary language features and performance needs.Internal Structure
Node Composition
In a rope data structure, nodes are the fundamental building blocks, categorized primarily into leaf nodes and internal concatenation nodes. Leaf nodes store the actual string data as contiguous arrays of characters; in the original design, these are non-empty flat strings, while implementations may use fixed-size chunks for efficiency. For instance, C++ implementations like Boost and GNU libstdc++ allocate leaf chunks in increments of 8 bytes.[6] Each leaf node includes a field for the length of the stored string. Additionally, leaf nodes can incorporate a function pointer to compute characters on demand, such as for accessing data from files or other non-contiguous sources.[1] Internal nodes, often termed concatenation nodes, serve to link sub-ropes without storing character data directly; they consist of two pointers to child nodes (left and right) and a length field for the left child (weight) to facilitate traversal. The total length is the sum of left and right subtree lengths, explicitly stored in some implementations like Boost and GNU libstdc++. These nodes also feature a depth field, usually a single byte, to track the node's position in the tree and support balancing checks. In the original C implementation, ropes assume garbage collection for memory management; C++ implementations like Boost and GNU libstdc++ use a reference count field to manage shared ownership without GC, supporting manual deallocation or atomic updates for thread safety.[1][6] Some rope variants include specialized node types for advanced features, such as substring nodes that hold a pointer to a base flat cord along with a start offset in the original design, or to a general base rope in C++ implementations; function nodes defer string computation via a callable object. In C++ implementations, a tag field (e.g., 8-bit enum) distinguishes node types, and a balanced flag indicates if the subtree is balanced. The original design uses depth and length thresholds based on Fibonacci numbers for balancing, without a flag. Optional metadata fields, like a cached flattened C-string representation, may be present in implementations to accelerate operations. These attributes ensure ropes remain efficient for large-scale text manipulation.[1][6] In terms of memory layout, particularly in C++ implementations like those in GNU libstdc++ or Boost, nodes are defined as structs inheriting from a base class that includes the size and reference count fields for cache-friendly alignment; leaf nodes embed the character array directly, while internal nodes use pointers to children, all sized with standard types likesize_t for lengths and unsigned char for depth and tags to minimize overhead. This structure allows leaf chunks to be allocated contiguously, reducing fragmentation, and supports atomic reference counting for thread safety in shared environments. The compact design of internal nodes—often fitting within a single cache line—contributes to the overall performance of rope operations.[6]
Tree Organization
A rope is structured as a binary tree in which the leaves store the fundamental string chunks, typically implemented as contiguous arrays of characters or other flat representations, while internal nodes denote the concatenation of the strings represented by their left and right subtrees. This hierarchical arrangement enables the efficient storage and manipulation of large strings by distributing the content across multiple smaller segments, with the overall tree maintained in a nearly balanced state to ensure that the height remains logarithmic in the total string length, thereby supporting efficient access patterns. In the original design, balancing uses Fibonacci numbers, where a rope of depth n must have length at least the (n+2)th Fibonacci number.[1] The path from the root to a specific leaf traverses a sequence of left and right child selections, corresponding to a prefix of the entire string up to that leaf's content. The full string represented by the rope is reconstructed through an in-order traversal of the leaves, concatenating their individual chunks from left to right in the order they appear in the tree. This traversal yields the complete sequence without requiring the tree to be fully materialized in memory at once.[1] Each internal node includes a weight field that records the total length of the string in its left subtree, providing a mechanism to navigate the tree efficiently during indexing or splitting operations by comparing the target position against these cumulative weights. This weighting supports decisions that promote balance in the tree structure, helping to keep subtrees of comparable size and preventing degeneration into skewed forms. In implementations like the Boost C++ library, the weight is the size attribute of the left child.[1][6] Ropes are designed to be persistent data structures, meaning that the trees are immutable once created, and any modification operations produce a new root node while leaving the original tree intact. This immutability facilitates sharing of subtrees across multiple ropes, as common prefixes or segments can be referenced without duplication, enabling a form of copy-on-write semantics where only the affected path from the root to the modification site is copied and updated. Such persistence is particularly valuable in scenarios involving versioned or branched string manipulations, as it minimizes memory overhead for shared components. In the original C implementation, ropes are represented as pointers toconst char* (NULL for empty, direct string, function closure, or concatenation node).[1][6]
Core Operations
Concatenation
Concatenation in ropes is a fundamental operation that efficiently combines two existing ropes into a single rope, leveraging the binary tree structure to achieve near-constant time performance without copying the underlying data. The process involves creating a new internal node, known as a concatenation node, that serves as the root with the two input ropes as its left and right children. This node also stores the total length of the combined rope and a weight field representing the length of the left subtree, enabling subsequent operations to traverse the tree efficiently.[1] To handle small ropes and prevent tree imbalance from repeated concatenations of short strings, the algorithm includes optimizations for cases where one or both inputs are short leaf nodes (typically flat strings below a certain size threshold, such as 100 characters). If both ropes are short leaves, they are directly concatenated into a single new leaf node containing the combined string, avoiding unnecessary tree depth. Similarly, if the right child of the left rope and the entire right rope are both short leaves, these are merged into a single leaf, which then replaces the right child of the left rope before forming the new concatenation node. These heuristics ensure that leaves remain reasonably sized, reducing space overhead and traversal costs while maintaining balanced trees in practice.[1] The time complexity of concatenation is O(1) in the worst case for creating the new node and updating metadata, as it requires no data copying or extensive traversal. However, if the operation triggers rebalancing—such as when the tree depth exceeds a predefined limit (e.g., the machine word size)—the amortized cost becomes O(log n), where n is the total length of the rope, due to the logarithmic height of the balanced tree.[1] The following pseudocode illustrates the core concatenation function, incorporating the basic node creation and a simplified check for small leaves:function concat(r1, r2):
if is_short_leaf(r1) and is_short_leaf(r2):
return new Leaf(concatenate_strings(r1.data, r2.data))
total_length = r1.length + r2.length
return new ConcatNode(left=r1, right=r2, length=total_length, weight=r1.length)
function concat(r1, r2):
if is_short_leaf(r1) and is_short_leaf(r2):
return new Leaf(concatenate_strings(r1.data, r2.data))
total_length = r1.length + r2.length
return new ConcatNode(left=r1, right=r2, length=total_length, weight=r1.length)
Splitting
Splitting a rope at a specified index divides it into two independent ropes: the left rope containing the prefix up to but not including the index, and the right rope containing the suffix starting from the index. This operation leverages the rope's binary tree structure to achieve O(log n) time complexity, where n is the rope's total length, by performing a traversal similar to that used in indexing.[7] The algorithm begins at the root and traverses downward, using the weight (the total character count) of the left subtree at each internal node to guide the descent. If the split index is less than or equal to the left subtree's weight, the traversal recurses into the left child; otherwise, the index is decremented by the left weight, and traversal proceeds to the right child. This process identifies the path to the node containing the split point, effectively decomposing the tree into left and right components through two split operations on the search tree.[7][1] To maintain persistence, the resulting left and right ropes share unmodified subtrees and leaves with the original rope, while only the nodes along the O(log n)-long path to the split point are copied and adjusted with new pointers. This path copying ensures that the original rope remains unchanged, and subsequent modifications to either split result do not affect the other, with shared leaves persisting until explicitly altered.[7][1] Edge cases are handled efficiently: splitting at index 0 yields an empty left rope and the original as the right rope, while splitting at the full length produces the original as the left rope and an empty right rope. Empty ropes are represented without nodes, often as null references, to prevent allocation overhead. If the split occurs within a leaf, the leaf's string is divided into two new leaves containing the respective substrings.[7][1] The operation is typically implemented via recursive descent, as shown in the following pseudocode, which returns a pair of new ropes:function split(rope, index):
if rope is null or index <= 0:
return (null, rope) // empty left, full right
if index >= [weight](/page/The_Weight)(rope):
return (rope, null) // full left, empty right
if rope is [leaf](/page/Leaf):
left_str = rope.str.[substring](/page/Substring)(0, index)
right_str = rope.str.[substring](/page/Substring)(index)
return (new_leaf(left_str), new_leaf(right_str))
else:
left_weight = [weight](/page/The_Weight)(rope.left)
if index <= left_weight:
([leftL](/page/Leaf), [leftR](/page/Rope)) = split(rope.left, index)
return ([leftL](/page/Leaf), new_node([leftR](/page/Rope), rope.right))
else:
([rightL](/page/Rope), [rightR](/page/The_Rope)) = split(rope.right, index - left_weight)
return (new_node(rope.left, [rightL](/page/Rope)), [rightR](/page/The_Rope))
function split(rope, index):
if rope is null or index <= 0:
return (null, rope) // empty left, full right
if index >= [weight](/page/The_Weight)(rope):
return (rope, null) // full left, empty right
if rope is [leaf](/page/Leaf):
left_str = rope.str.[substring](/page/Substring)(0, index)
right_str = rope.str.[substring](/page/Substring)(index)
return (new_leaf(left_str), new_leaf(right_str))
else:
left_weight = [weight](/page/The_Weight)(rope.left)
if index <= left_weight:
([leftL](/page/Leaf), [leftR](/page/Rope)) = split(rope.left, index)
return ([leftL](/page/Leaf), new_node([leftR](/page/Rope), rope.right))
else:
([rightL](/page/Rope), [rightR](/page/The_Rope)) = split(rope.right, index - left_weight)
return (new_node(rope.left, [rightL](/page/Rope)), [rightR](/page/The_Rope))
Access and Modification Operations
Indexing
In a rope data structure, indexing to access a character at a specific position involves traversing the binary tree from the root node downward, using the cumulative lengths stored in each node's weight field to decide whether to descend into the left or right subtree.[1] This process continues until a leaf node is reached, at which point the offset within the leaf's string buffer is computed to retrieve the desired character directly.[1] The algorithm assumes balanced trees, such as AVL or B-trees, to ensure logarithmic depth.[7] For retrieving substrings spanning multiple leaves, the operation uses two split operations: first splitting the rope at the starting index to separate the prefix, then splitting the remaining rope at the desired length to isolate the substring as a new sub-rope.[1] This creates a new rope representing the range without flattening or reconstructing the contents into a contiguous string.[1] The time complexity for indexing a single character is , dominated by the tree descent proportional to the height, with constant-time access within the fixed-size leaf buffer.[1] Substring retrieval to create a sub-rope is also ; flattening the sub-rope to a contiguous string would require additional time, where is the substring length.[1] The following pseudocode illustrates the core traversal for character access:function getChar(rope, index):
node = rope.root
while not isLeaf(node):
if index < node.left.weight:
node = node.left
else:
index -= node.left.weight
node = node.right
return node.data[index]
function getChar(rope, index):
node = rope.root
while not isLeaf(node):
if index < node.left.weight:
node = node.left
else:
index -= node.left.weight
node = node.right
return node.data[index]