Hubbry Logo
search
logo
537958

Block sort

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia
Block sort
Block sort stably sorting numbers 1 to 16.
Insertion sort groups of 16, extract two internal buffers, tag the A blocks (of size 16 = 4 each), roll the A blocks through B, locally merge them, sort the second buffer, and redistribute the buffers.
ClassSorting algorithm
Data structureArray
Worst-case performanceO(n log n)
Best-case performanceO(n)
Average performanceO(n log n)
Worst-case space complexityO(1)

Block sort, or block merge sort, is a sorting algorithm combining at least two merge operations with an insertion sort to arrive at O(n log n) (see Big O notation) in-place stable sorting time. It gets its name from the observation that merging two sorted lists, A and B, is equivalent to breaking A into evenly sized blocks, inserting each A block into B under special rules, and merging AB pairs.

One practical algorithm for O(n log n) in-place merging was proposed by Pok-Son Kim and Arne Kutzner in 2008.[1]

Overview

[edit]

The outer loop of block sort is identical to a bottom-up merge sort, where each level of the sort merges pairs of subarrays, A and B, in sizes of 1, then 2, then 4, 8, 16, and so on, until both subarrays combined are the array itself.

Rather than merging A and B directly as with traditional methods, a block-based merge algorithm divides A into discrete blocks of size A (resulting in A number of blocks as well),[2] inserts each A block into B such that the first value of each A block is less than or equal (≤) to the B value immediately after it, then locally merges each A block with any B values between it and the next A block.

As merges still require a separate buffer large enough to hold the A block to be merged, two areas within the array are reserved for this purpose (known as internal buffers).[3] The first two A blocks are thus modified to contain the first instance of each value within A, with the original contents of those blocks shifted over if necessary. The remaining A blocks are then inserted into B and merged using one of the two buffers as swap space. This process causes the values in that buffer to be rearranged.

Once every A and B block of every A and B subarray have been merged for that level of the merge sort, the values in that buffer must be sorted to restore their original order, so an insertion sort must be applied. The values in the buffers are then redistributed to their first sorted position within the array. This process repeats for each level of the outer bottom-up merge sort, at which point the array will have been stably sorted.

Algorithm

[edit]

The following operators are used in the code examples:

| bitwise OR
>> shift right
% modulo
++ and += increment
[x, y) range from ≥ x and < y
|range| range.end – range.start
array[i] i-th item of array

Additionally, block sort relies on the following operations as part of its overall algorithm:

  • Swap: exchange the positions of two values in an array.
  • Block swap: exchange a range of values within an array with values in a different range of the array.
  • Binary search: assuming the array is sorted, check the middle value of the current search range, then if the value is lesser check the lower range, and if the value is greater check the upper range. Block sort uses two variants: one which finds the first position to insert a value in the sorted array, and one which finds the last position.
  • Linear search: find a particular value in an array by checking every single element in order, until it is found.
  • Insertion sort: for each item in the array, loop backward and find where it needs to be inserted, then insert it at that position.
  • Array rotation: move the items in an array to the left or right by some number of spaces, with values on the edges wrapping around to the other side. Rotations can be implemented as three reversals.[4]
Rotate(array, amount, range)
    Reverse(array, range)
    Reverse(array, [range.start, range.start + amount))
    Reverse(array, [range.start + amount, range.end))
  • Floor power of two: floor a value to the next power of two. Thus 63 becomes 32, 64 stays 64, and so forth.[5]
FloorPowerOfTwo(x)
    x = x | (x >> 1)
    x = x | (x >> 2)
    x = x | (x >> 4)
    x = x | (x >> 8)
    x = x | (x >> 16)
    if (this is a 64-bit system)
        x = x | (x >> 32)
    return x - (x >> 1)

Outer loop

[edit]

As previously stated, the outer loop of a block sort is identical to a bottom-up merge sort. However, it benefits from the variant that ensures each A and B subarray are the same size to within one item:

   BlockSort(array)
       power_of_two = FloorPowerOfTwo(array.size)
       scale = array.size/power_of_two // 1.0 ≤ scale < 2.0
      
       // insertion sort 16–31 items at a time
       for (merge = 0; merge < power_of_two; merge += 16)
           start = merge * scale
           end = start + 16 * scale
           InsertionSort(array, [start, end))
      
       for (length = 16; length < power_of_two; length += length)
           for (merge = 0; merge < power_of_two; merge += length * 2)
               start = merge * scale
               mid = (merge + length) * scale
               end = (merge + length * 2) * scale
              
               if (array[end − 1] < array[start])
                   // the two ranges are in reverse order, so a rotation is enough to merge them
                   Rotate(array, mid − start, [start, end))
               else if (array[mid − 1] > array[mid])
                   Merge(array, A = [start, mid), B = [mid, end))
               // else the ranges are already correctly ordered

Fixed-point math may also be used, by representing the scale factor as a fraction integer_part + numerator/denominator:

   BlockSort(array)
       power_of_two = FloorPowerOfTwo(array.size)
       denominator = power_of_two/16
       numerator_step = array.size % denominator
       integer_step = floor(array.size/denominator)
   
       // insertion sort 16–31 items at a time
   
       while (integer_step < array.size)
           integer_part = numerator = 0
           while (integer_part < array.size)
               // get the ranges for A and B
               start = integer_part
           
               integer_part += integer_step
               numerator += numerator_step
               if (numerator ≥ denominator)
                   numerator −= denominator
                   integer_part++
           
               mid = integer_part
           
               integer_part += integer_step
               numerator += numerator_step
               if (numerator ≥ denominator)
                   numerator −= denominator
                   integer_part++
           
               end = integer_part
           
               if (array[end − 1] < array[start])
                   Rotate(array, mid − start, [start, end))
               else if (array[mid − 1] > array[mid])
                   Merge(array, A = [start, mid), B = [mid, end))
       
           integer_step += integer_step
           numerator_step += numerator_step
           if (numerator_step ≥ denominator)
               numerator_step −= denominator
               integer_step++

Extract buffers

[edit]
The buffer extraction process for block sort.

The two internal buffers needed for each level of the merge step are created by moving the first 2A first instances of their values within an A subarray to the start of A. First it iterates over the elements in A and counts off the unique values it needs, then it applies array rotations to move those unique values to the start.[6] If A did not contain enough unique values to fill the two buffers (of size A each), B can be used just as well. In this case it moves the last instance of each value to the end of B, with that part of B not being included during the merges.

while (integer_step < array.size)
    block_size = integer_step
    buffer_size = integer_step/block_size + 1
    [extract two buffers of size 'buffer_size' each]

If B does not contain enough unique values either, it pulls out the largest number of unique values it could find, then adjusts the size of the A and B blocks such that the number of resulting A blocks is less than or equal to the number of unique items pulled out for the buffer. Only one buffer will be used in this case – the second buffer won't exist.

buffer_size = [number of unique values found]
block_size = integer_step/buffer_size + 1
  
integer_part = numerator = 0
while (integer_part < array.size)
    [get the ranges for A and B]
    [adjust A and B to not include the ranges used by the buffers]

Tag A blocks

[edit]
Tagging the A blocks using values from the first internal buffer. Note that the first A block and last B block are unevenly sized.

Once the one or two internal buffers have been created, it begins merging each A and B subarray for this level of the merge sort. To do so, it divides each A and B subarray into evenly sized blocks of the size calculated in the previous step, where the first A block and last B block are unevenly sized if needed. It then loops over each of the evenly sized A blocks and swaps the second value with a corresponding value from the first of the two internal buffers. This is known as tagging the blocks.

// blockA is the range of the remaining A blocks,
// and firstA is the unevenly sized first A block
blockA = [A.start, A.end)
firstA = [A.start, A.start + |blockA| % block_size)
  
// swap the second value of each A block with the value in buffer1
for (index = 0, indexA = firstA.end + 1; indexA < blockA.end; indexA += block_size)
    Swap(array[buffer1.start + index], array[indexA])
    index++
  
lastA = firstA
blockB = [B.start, B.start + minimum(block_size, |B|))
blockA.start += |firstA|

Roll and drop

[edit]
Two A blocks rolling through the B blocks. Once the first A block is dropped behind, the unevenly sized A block is locally merged with the B values that follow it.

After defining and tagging the A blocks in this manner, the A blocks are rolled through the B blocks by block swapping the first evenly sized A block with the next B block. This process repeats until the first value of the A block with the smallest tag value is less than or equal to the last value of the B block that was just swapped with an A block.

At that point, the minimum A block (the A block with the smallest tag value) is swapped to the start of the rolling A blocks and the tagged value is restored with its original value from the first buffer. This is known as dropping a block behind, as it will no longer be rolled along with the remaining A blocks. That A block is then inserted into the previous B block, first by using a binary search on B to find the index where the first value of A is less than or equal to the value at that index of B, and then by rotating A into B at that index.

   minA = blockA.start
   indexA = 0
  
   while (true)
       // if there's a previous B block and the first value of the minimum A block is ≤
       // the last value of the previous B block, then drop that minimum A block behind.
       // or if there are no B blocks left then keep dropping the remaining A blocks.
       if ((|lastB| > 0 and array[lastB.end - 1] ≥ array[minA]) or |blockB| = 0)
           // figure out where to split the previous B block, and rotate it at the split
           B_split = BinaryFirst(array, array[minA], lastB)
           B_remaining = lastB.end - B_split
          
           // swap the minimum A block to the beginning of the rolling A blocks
           BlockSwap(array, blockA.start, minA, block_size)
          
           // restore the second value for the A block
           Swap(array[blockA.start + 1], array[buffer1.start + indexA])
           indexA++
          
           // rotate the A block into the previous B block
           Rotate(array, blockA.start - B_split, [B_split, blockA.start + block_size))
          
           // locally merge the previous A block with the B values that follow it,
           // using the second internal buffer as swap space (if it exists)
           if (|buffer2| > 0)
               MergeInternal(array, lastA, [lastA.end, B_split), buffer2)
           else
               MergeInPlace(array, lastA, [lastA.end, B_split))
          
           // update the range for the remaining A blocks,
           // and the range remaining from the B block after it was split
           lastA = [blockA.start - B_remaining, blockA.start - B_remaining + block_size)
           lastB = [lastA.end, lastA.end + B_remaining)
          
           // if there are no more A blocks remaining, this step is finished
           blockA.start = blockA.start + block_size
           if (|blockA| = 0)
               break
          
           minA = [new minimum A block] (see below)
       else if (|blockB| < block_size)
           // move the last B block, which is unevenly sized,
           // to before the remaining A blocks, by using a rotation
           Rotate(array, blockB.start - blockA.start, [blockA.start, blockB.end))
          
           lastB = [blockA.start, blockA.start + |blockB|)
           blockA.start += |blockB|
           blockA.end += |blockB|
           minA += |blockB|
           blockB.end = blockB.start
       else
           // roll the leftmost A block to the end by swapping it with the next B block
           BlockSwap(array, blockA.start, blockB.start, block_size)
           lastB = [blockA.start, blockA.start + block_size)
           if (minA = blockA.start)
               minA = blockA.end
          
           blockA.start += block_size
           blockA.end += block_size
           blockB.start += block_size
          
           // this is equivalent to minimum(blockB.end + block_size, B.end),
           // but that has the potential to overflow
           if (blockB.end > B.end - block_size)
               blockB.end = B.end
           else
               blockB.end += block_size
  
   // merge the last A block with the remaining B values
   if (|buffer2| > 0)
       MergeInternal(array, lastA, [lastA.end, B.end), buffer2)
   else
       MergeInPlace(array, lastA, [lastA.end, B.end))

One optimization that can be applied during this step is the floating-hole technique.[7] When the minimum A block is dropped behind and needs to be rotated into the previous B block, after which its contents are swapped into the second internal buffer for the local merges, it would be faster to swap the A block to the buffer beforehand, and to take advantage of the fact that the contents of that buffer do not need to retain any order. So rather than rotating the second buffer (which used to be the A block before the block swap) into the previous B block at position index, the values in the B block after index can simply be block swapped with the last items of the buffer.

The floating hole in this case refers to the contents of the second internal buffer floating around the array, and acting as a hole in the sense that the items do not need to retain their order.

Local merges

[edit]

Once the A block has been rotated into the B block, the previous A block is then merged with the B values that follow it, using the second buffer as swap space. When the first A block is dropped behind this refers to the unevenly sized A block at the start, when the second A block is dropped behind it means the first A block, and so forth.

MergeInternal(array, A, B, buffer)
    // block swap the values in A with those in 'buffer'
    BlockSwap(array, A.start, buffer.start, |A|)

    A_count = 0, B_count = 0, insert = 0
    while (A_count < |A| and B_count < |B|)
        if (array[buffer.start + A_count] ≤ array[B.start + B_count])
            Swap(array[A.start + insert], array[buffer.start + A_count])
            A_count++
        else
            Swap(array[A.start + insert], array[B.start + B_count])
            B_count++
        insert++

    // block swap the remaining part of the buffer with the remaining part of the array
    BlockSwap(array, buffer.start + A_count, A.start + insert, |A| - A_count)

If the second buffer does not exist, a strictly in-place merge operation must be performed, such as a rotation-based version of the Hwang and Lin algorithm,[7][8] the Dudzinski and Dydek algorithm,[9] or a repeated binary search and rotate.

MergeInPlace(array, A, B)
    while (|A| > 0 and |B| > 0)
        // find the first place in B where the first item in A needs to be inserted
        mid = BinaryFirst(array, array[A.start], B)

        // rotate A into place
        amount = mid - A.end
        Rotate(array, amount, [A.start, mid))

        // calculate the new A and B ranges
        B = [mid, B.end)
        A = [A.start + amount, mid)
        A.start = BinaryLast(array, array[A.start], A)

After dropping the minimum A block and merging the previous A block with the B values that follow it, the new minimum A block must be found within the blocks that are still being rolled through the array. This is handled by running a linear search through those A blocks and comparing the tag values to find the smallest one.

minA = blockA.start
for (findA = minA + block_size; findA < blockA.end - 1; findA += block_size)
    if (array[findA + 1] < array[minA + 1])
        minA = findA

These remaining A blocks then continue rolling through the array and being dropped and inserted where they belong. This process repeats until all of the A blocks have been dropped and rotated into the previous B block.

Once the last remaining A block has been dropped behind and inserted into B where it belongs, it should be merged with the remaining B values that follow it. This completes the merge process for that particular pair of A and B subarrays. However, it must then repeat the process for the remaining A and B subarrays for the current level of the merge sort.

Note that the internal buffers can be reused for every set of A and B subarrays for this level of the merge sort, and do not need to be re-extracted or modified in any way.

Redistribute

[edit]

After all of the A and B subarrays have been merged, the one or two internal buffers are still left over. The first internal buffer was used for tagging the A blocks, and its contents are still in the same order as before, but the second internal buffer may have had its contents rearranged when it was used as swap space for the merges. This means the contents of the second buffer will need to be sorted using a different algorithm, such as insertion sort. The two buffers must then be redistributed back into the array using the opposite process that was used to create them.

After repeating these steps for every level of the bottom-up merge sort, the block sort is completed.

Variants

[edit]

Block sort works by extracting two internal buffers, breaking A and B subarrays into evenly sized blocks, rolling and dropping the A blocks into B (using the first buffer to track the order of the A blocks), locally merging using the second buffer as swap space, sorting the second buffer, and redistributing both buffers. While the steps do not change, these subsystems can vary in their actual implementation.

One variant of block sort allows it to use any amount of additional memory provided to it, by using this external buffer for merging an A subarray or A block with B whenever A fits into it. In this situation it would be identical to a merge sort.

Good choices for the buffer size include:

Size Notes
(count + 1)/2 turns into a full-speed merge sort since all of the A subarrays will fit into it
(count + 1)/2 + 1 this will be the size of the A blocks at the largest level of merges, so block sort can skip using internal or in-place merges for anything
512 a fixed-size buffer large enough to handle the numerous merges at the smaller levels of the merge sort
0 if the system cannot allocate any extra memory, no memory works well

Rather than tagging the A blocks using the contents of one of the internal buffers, an indirect movement-imitation buffer can be used instead.[1][10] This is an internal buffer defined as s1 t s2, where s1 and s2 are each as large as the number of A and B blocks, and t contains any values immediately following s1 that are equal to the last value of s1 (thus ensuring that no value in s2 appears in s1). A second internal buffer containing A unique values is still used. The first A values of s1 and s2 are then swapped with each other to encode information into the buffer about which blocks are A blocks and which are B blocks. When an A block at index i is swapped with a B block at index j (where the first evenly sized A block is initially at index 0), s1[i] and s1[j] are swapped with s2[i] and s2[j], respectively. This imitates the movements of the A blocks through B. The unique values in the second buffer are used to determine the original order of the A blocks as they are rolled through the B blocks. Once all of the A blocks have been dropped, the movement-imitation buffer is used to decode whether a given block in the array is an A block or a B block, each A block is rotated into B, and the second internal buffer is used as swap space for the local merges.

The second value of each A block doesn't necessarily need to be tagged – the first, last, or any other element could be used instead. However, if the first value is tagged, the values will need to be read from the first internal buffer (where they were swapped) when deciding where to drop the minimum A block.

Many sorting algorithms can be used to sort the contents of the second internal buffer, including unstable sorts like quicksort, since the contents of the buffer are guaranteed to be unique. Insertion sort is still recommended, though, for its situational performance and lack of recursion.

Implementations

[edit]

Known implementations of Block sort include:

  • Kutzner and Kim's implementation in C++.[11]
  • Wikisort, Mike McFadden's implementation based on Kutzner and Kim's paper.[12]
  • GrailSort (and the rewritten HolyGrailSort), Andrey Astrelin's implementation based on Huang and Langston (1992),[13] which ultimately describes a very similar algorithm.[14]

Analysis

[edit]

Block sort is a well-defined and testable class of algorithms, with working implementations available as a merge and as a sort.[11][15][12] This allows its characteristics to be measured and considered.

Complexity

[edit]

Block sort begins by performing insertion sort on groups of 16–31 items in the array. Insertion sort is an O(n2) operation, so this leads to anywhere from O(162 × n/16) to O(312 × n/31), which is O(n) once the constant factors are omitted. It must also apply an insertion sort on the second internal buffer after each level of merging is completed. However, as this buffer was limited to in size, the operation also ends up being O(n).

Next it must extract two internal buffers for each level of the merge sort. It does so by iterating over the items in the A and B subarrays and incrementing a counter whenever the value changes, and upon finding enough values it rotates them to the start of A or the end of B. In the worst case this will end up searching the entire array before finding non-contiguous unique values, which requires O(n) comparisons and rotations for values. This resolves to , or O(n).

When none of the A or B subarrays contained unique values to create the internal buffers, a normally suboptimal in-place merge operation is performed where it repeatedly binary searches and rotates A into B. However, the known lack of unique values within any of the subarrays places a hard limit on the number of binary searches and rotations that will be performed during this step, which is again items rotated up to times, or O(n). The size of each block is also adjusted to be smaller in the case where it found unique values but not 2, which further limits the number of unique values contained within any A or B block.

Tagging the A blocks is performed times for each A subarray, then the A blocks are rolled through and inserted into the B blocks up to times. The local merges retain the same O(n) complexity of a standard merge, albeit with more assignments since the values must be swapped rather than copied. The linear search for finding the new minimum A block iterates over blocks times. And the buffer redistribution process is identical to the buffer extraction but in reverse, and therefore has the same O(n) complexity.

After omitting all but the highest complexity and considering that there are log(n) levels in the outer merge loop, this leads to a final asymptotic complexity of O(n log(n)) for the worst and average cases. For the best case, where the data is already in order, the merge step performs n/16 comparisons for the first level, then n/32, n/64, n/128, etc. This is a well-known mathematical series which resolves to O(n).

Memory

[edit]

As block sort is non-recursive and does not require the use of dynamic allocations, this leads to constant stack and heap space. It uses O(1) auxiliary memory in a transdichotomous model, which accepts that the O(log n) bits needed to keep track of the ranges for A and B cannot be any greater than 32 or 64 on 32-bit or 64-bit computing systems, respectively, and therefore simplifies to O(1) space for any array that can feasibly be allocated.

Stability

[edit]

Although items in the array are moved out of order during a block sort, each operation is fully reversible and will have restored the original order of equivalent items by its completion.

Stability requires the first instance of each value in an array before sorting to still be the first instance of that value after sorting. Block sort moves these first instances to the start of the array to create the two internal buffers, but when all of the merges are completed for the current level of the block sort, those values are distributed back to the first sorted position within the array. This maintains stability.

Before rolling the A blocks through the B blocks, each A block has its second value swapped with a value from the first buffer. At that point the A blocks are moved out of order to roll through the B blocks. However, once it finds where it should insert the smallest A block into the previous B block, that smallest A block is moved back to the start of the A blocks and its second value is restored. By the time all of the A blocks have been inserted, the A blocks will be in order again and the first buffer will contain its original values in the original order.

Using the second buffer as swap space when merging an A block with some B values causes the contents of that buffer to be rearranged. However, as the algorithm already ensured the buffer only contains unique values, sorting the contents of the buffer is sufficient to restore their original stable order.

Adaptivity

[edit]

Block sort is an adaptive sort on two levels: first, it skips merging A and B subarrays that are already in order. Next, when A and B need to be merged and are broken into evenly sized blocks, the A blocks are only rolled through B as far as is necessary, and each block is only merged with the B values immediately following it. The more ordered the data originally was, the fewer B values there will be that need to be merged into A.

Advantages

[edit]

Block sort is a stable sort that does not require additional memory, which is useful in cases where there is not enough free memory to allocate the O(n) buffer. When using the external buffer variant of block sort, it can scale from using O(n) memory to progressively smaller buffers as needed, and will still work efficiently within those constraints.

Disadvantages

[edit]

Block sort does not exploit sorted ranges of data on as fine a level as some other algorithms, such as Timsort.[16] It only checks for these sorted ranges at the two predefined levels: the A and B subarrays, and the A and B blocks. It is also harder to implement and parallelize compared to a merge sort.

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Block sort, also known as block merge sort, is a stable, in-place sorting algorithm that achieves a worst-case time complexity of O(n log n) while requiring only O(1) extra space beyond the input array.[1] It operates by dividing the input sequence into small blocks of size approximately √n, sorting each block individually using a simple method like insertion sort in linear time relative to the block size, and then iteratively merging these sorted blocks through a series of block interchanges and rotations that preserve stability without allocating additional memory proportional to the input.[1] The core innovation of block sort lies in its merging procedure, which treats the two sorted halves to be merged as consisting of O(√n) blocks each of size O(√n), enabling a linear-time merge by performing block sorts on subsets and strategic rotations to resolve overlaps.[1] This approach ensures stability by tracking the origin of elements during block rearrangements and performs a linear number of key comparisons and exchanges for each merge, with the full sort requiring O(n log n) operations overall.[1] Originally proposed by Bing-Chao Huang and Michael A. Langston in 1992, the algorithm was designed for applications where memory is at a premium, such as external sorting or embedded systems, and it outperforms traditional merge sort in space-constrained scenarios despite higher constant factors in time.[1] Subsequent implementations and refinements, such as WikiSort and GrailSort, have built upon this foundation to enhance practical performance on modern hardware by optimizing block sizes for cache locality and incorporating ratio-based merging techniques.[2][3] These variants maintain the original algorithm's theoretical guarantees while achieving competitive speeds in various software projects.

Introduction

Definition and Overview

Block sort, also known as block merge sort, is a hybrid sorting algorithm that integrates insertion sort for sorting small blocks with subsequent merge operations to achieve stable, in-place sorting of arrays in O(n log n) time complexity. It operates by dividing the input array into fixed-size blocks, typically of a small constant size such as 16 or 32 elements,[2] and initially sorting each block individually using an efficient local sorting method like insertion sort. These sorted blocks are then progressively merged in a bottom-up manner to produce the fully sorted array, ensuring stability by preserving the relative order of equal elements throughout the process.[4] The primary purpose of block sort is to provide an efficient sorting solution in memory-constrained environments, where traditional merge sort's requirement for O(n) auxiliary space is prohibitive. Modern iterative implementations avoid recursion and external memory allocation, achieving practical performance comparable to standard merge sort while using only O(1) extra space beyond the input array. This makes it suitable for systems with limited RAM, such as embedded devices or large-scale data processing where in-place operations are essential.[2] A key innovation in block sort is the extraction of temporary buffers directly from the array itself, utilizing dedicated block regions as swap space during merges. This internal buffer mechanism facilitates stable merging without overwriting unsorted data, enabling the algorithm to perform all operations within the original array footprint. The approach builds on foundational in-place merging techniques but optimizes them for block-based partitioning to minimize element movements and maintain worst-case efficiency.[4] Block sort relates to traditional merge sort through its divide-and-merge paradigm but adapts it specifically for in-place execution using block-level granularity.[5]

Historical Context and Motivation

The concept of block merge sort originated with the 1992 paper by B.-C. Huang and M. A. Langston, which proposed fast stable merging and sorting in constant extra space using a block-based approach.[1] Block sort emerged as an advancement in sorting algorithms during the late 2000s, primarily driven by efforts to achieve efficient, stable, and in-place merging within merge sort frameworks. Traditional merge sort, while asymptotically optimal with O(n log n) time complexity, often requires O(n) extra space for merging and recursive implementations risk stack overflow for large inputs. Researchers addressed these issues by focusing on in-place stable merging techniques, with foundational contributions from Pok-Son Kim and Arne Kutzner. Their 2006 paper introduced an optimal in-place merging algorithm requiring O(m log(n/m + 1)) comparisons and O(m + n) assignments for merging sequences of sizes m and n (m ≤ n), building on earlier works like Symvonis's stable merging from 1993. This laid the groundwork for block-based approaches by enabling merging without additional buffers beyond a small constant space.[6] The 2008 paper by Kim and Kutzner further refined this with a ratio-based stable in-place merging algorithm, optimizing performance based on the input size ratio k = n/m. This method, presented at the Third International Conference on Theory and Applications of Models of Computation (TAMC), proposed a modular "Stable-Optimal-Block-Merge" procedure that achieves asymptotic optimality for both comparisons (Ω(m log(k + 1))) and assignments (m + n), outperforming prior algorithms in benchmarks, particularly for symmetric inputs common in merge sort. Block sort evolved directly from these merging innovations, integrating them into a bottom-up merge sort structure—iterative to avoid recursion—while incorporating insertion sort for initial small-block sorting. This hybrid design addressed limitations of unstable algorithms like heapsort, which lack stability (preserving relative order of equal elements), and non-adaptive sorts that fail to exploit partial order. Unlike TimSort, introduced in 2002 for Python and emphasizing natural run detection for adaptability, block sort prioritizes fixed block sizes (e.g., 16-32 elements) for predictable worst-case performance without run scanning overhead.[5][4] The motivation for block sort stemmed from practical needs in resource-constrained environments, such as embedded systems or standard library implementations in languages like C++, where std::stable_sort often allocates temporary arrays, increasing memory footprint. By achieving O(1) extra space and guaranteed O(n log n) time, block sort provides a stable alternative to quicksort variants, which degrade to O(n²) in worst cases, while maintaining merge sort's stability without recursion's risks. Early descriptions appeared in academic proceedings like TAMC 2008, emphasizing its relevance to full sorting via repeated block merges. The first notable open-source implementation, WikiSort by Mike McFadden in 2014, popularized block sort by providing a complete, efficient realization in C++ based on the Kim-Kutzner merging, influencing subsequent variants and integrations in performance-critical software.[5][7][2]

Algorithm

Outer Loop and Merge Phases

The block sort algorithm adopts a non-recursive bottom-up merging strategy, iterating through phases where the merge width doubles progressively (e.g., 16, 32, 64, ..., up to the array length nn) to build larger sorted runs from smaller ones. This iterative framework leverages fixed-point arithmetic or integer scale factors to compute block boundaries precisely, circumventing potential precision issues associated with floating-point calculations and ensuring deterministic behavior across implementations.[4] Within each phase, the algorithm processes the array by identifying alternating blocks designated as A (unsorted or pending-merge segments) and B (pre-sorted runs from prior phases), merging them pairwise to form doubled-size sorted blocks while preserving stability. Prior to entering the outer loop, the input array is partitioned into small initial blocks typically of size 16 to 32 elements, with each block individually sorted via insertion sort to generate initial monotonic runs; this hybrid preprocessing phase runs in linear time overall and establishes the foundation for subsequent merges. The block size for merging is selected as approximately r\sqrt{r} where rr is the current run length, to balance the efficiency of later in-place merges.[4] The outer loop's flow control orchestrates each iteration by first invoking buffer extraction to allocate a temporary workspace from the array itself, followed by coordinated calls to the core merging routines that rearrange and combine the A and B blocks. Handling of odd-length arrays occurs through boundary adjustments in the phase setup, ensuring all elements are incorporated by treating the final partial block appropriately without breaking the doubling progression, thereby maintaining completeness across the entire array.[4]

Buffer Extraction

Buffer extraction in block sort is the initial step that reserves temporary storage areas directly from the input array, enabling in-place operations without requiring additional memory beyond a constant amount. The algorithm scans the array to identify approximately $ \sqrt{n} $ unique elements or exploitable gaps, which are then relocated to the array's extremities using efficient operations such as rotations or reversals. This creates two dedicated buffers: buffer A positioned at the front of the array and buffer B at the rear, each roughly of size $ \sqrt{n} $. By leveraging these self-contained buffers, the algorithm avoids external allocations while preparing for subsequent merging phases.[8] In cases where the array lacks sufficient unique elements—such as in datasets with high duplication—a linear search is employed to gather the maximum available distinct values, followed by targeted array rotations to consolidate them into the buffer positions. If even this yields inadequate space, the buffer sizes are dynamically reduced to match the available uniques, ensuring the algorithm remains functional. This adaptive handling ensures robustness across diverse input distributions.[9] The extracted buffers fulfill complementary roles in the sorting process: one primarily stores tagged values derived from the unique elements to track block identities and preserve stability during element swaps, while the other serves as general temporary storage for relocating elements amid merges. Post-extraction, both buffers are individually sorted using insertion sort, which operates in linear time relative to their size, to establish them as ordered workspaces ready for integration into the broader algorithm. This step leverages the small buffer dimensions for efficiency.[8] Consider an array of size $ n = 100 $, where buffers of size 10 are targeted by first scanning for 20 distinct elements overall. The process involves iteratively identifying uniques via linear search, then applying rotations to shift segments—such as rotating a prefix to position the first 10 uniques at the front and a suffix for the rear 10—effectively isolating the buffers while preserving the relative order of the remaining data. Insertion sort is then applied to each 10-element buffer, yielding sorted regions at indices 0-9 and 90-99, ready for tagging and merging use.[9]

Block Tagging

In block sort algorithms, block tagging marks blocks from sequence A (w-blocks) using a dedicated block distribution storage (bd-storage) managed in-place via the extracted buffers, which provide space equivalent to the number of blocks without additional memory allocation, to enable tracking during subsequent merge operations. The tagging is achieved by swapping specific elements between the main array and the buffer areas: for each w-block, corresponding pairs in the bd-storage are exchanged with elements in the array, while non-w-blocks remain untouched, effectively encoding the block's origin without additional data structures. This method leverages the buffers as temporary storage to preserve the relative order of blocks.[4] The primary purpose of block tagging is to facilitate identification of A block origins during the block rearrangement and local merging phases, avoiding the need for explicit pointers or extra space while ensuring sort stability by maintaining the relative positions of equal elements across blocks. By embedding block identity information directly into the array via these swaps, the algorithm supports efficient roll-and-drop operations that approximate the final merged order.[4] Implementation involves iterating through the sorted blocks in the outer loop, performing the swaps in the buffer-based bd-storage to tag w-blocks as described in the stable-optimal-block-merge procedure. The bd-storage functions as a key-value store, where the position of a value indicates the identity of the corresponding block, allowing straightforward recovery of block positions post-rearrangement. Buffers extracted earlier provide the necessary temporary space for these operations.[4] Edge cases, such as undersized blocks at the end of sequences (e.g., fewer than the block size δ = ⌊√m⌋ elements), are handled by excluding them from tagging and rearrangement to prevent errors, with the algorithm adjusting δ based on available buffer size to ensure all tags remain unique and recoverable. Special provisions maintain stability for these partial blocks during final local merges.[4]

Roll, Drop, and Local Merging

In the roll phase of block sort, tagged A blocks—identified from the prior block tagging step—are rotated rightward through adjacent B blocks to position them for merging. This operation simulates the movement of A blocks without requiring complete copies of large segments, leveraging a temporary buffer to hold the displaced B block elements during the rotation. By using block rotations and swaps, the algorithm preserves the relative order of elements within each block while advancing the A blocks toward their insertion points in the B sequence. This phase ensures efficient traversal, with the number of assignments bounded by O(m + n), where m and n represent the sizes of the respective block sequences.[4] The drop phase follows, where for each tagged A block, binary search is applied to the preceding B block to determine the optimal insertion point based on element values. The minimal prefix of the A block that should be inserted at this position is then "dropped" by rotating it into the B block, effectively integrating the smallest qualifying segment while leaving the remainder of the A block for subsequent drops. This process repeats iteratively for the remaining portion of the A block until it is fully emptied and merged, minimizing unnecessary movements and ensuring that dropped segments maintain their sorted order relative to the B elements. The use of binary search limits comparisons to O(log m) per drop, contributing to the overall efficiency of the merging step.[4] Local merging then resolves the unevenly sized segments resulting from the drop operations by combining dropped A prefixes with the surrounding B elements. This is performed using in-place algorithms such as the Hwang-Lin method, which merges a small sequence of size n into a larger one of size m in O(n log m) time, often facilitated by buffer swaps to temporarily store elements during the merge. For small local segments, this approach avoids full array scans, focusing instead on targeted rotations and comparisons to interleave the elements correctly. In cases of reverse-sorted blocks, the algorithm efficiently handles insertions by leveraging the binary search in drops to quickly identify large prefix matches, reducing the number of local merge iterations needed.[4] Stability is preserved throughout these phases by relying on the tags from block tagging to distinguish elements' origins (A or B blocks) during comparisons; when elements are equal, the merge prioritizes those from the earlier block to maintain their original relative order. The stable local merges, such as the Hwang-Lin variant employed, further ensure that no inversions occur within equal-element groups, while the tracked positions in the buffer and block storage prevent disruptions from rotations. This tag-based comparison and position tracking allows the algorithm to handle equal elements without additional space overhead, confirming its stability for arbitrary inputs including reverse-sorted blocks.[4]

Redistribution

In the redistribution phase of the block sort algorithm, the second buffer (buffer B), which holds elements extracted during the merge preparation, is first sorted using insertion sort to ensure its contents are in ascending order. Following this, the elements from both buffer A and the now-sorted buffer B are reintegrated into the main array through a linear scan of the array, where each buffer element is inserted at its correct position determined via binary search, with rotations used to shift existing elements and create space without disrupting stability.[4] Untagging occurs concurrently during reintegration: the original untagged values are restored by swapping them back from buffer A, which had held tagged copies for stability tracking during the roll, drop, and local merging steps, thereby yielding a clean, fully merged sorted run in the main array devoid of any markers.[10] To finalize the phase, the array is adjusted by performing a collective rotation or shift to collapse the temporary buffer spaces, rendering the entire subarray contiguous and sorted while preparing the structure for subsequent iterations or the next outer loop phase. This step ensures no fragmented regions remain.[10] The overall efficiency of redistribution is O(n) time per merge phase, dominated by the linear scans and constant-time rotations per insertion, which collectively process all n elements proportionally; any residual unsorted tails at the array's end are resolved via direct insertion into the growing sorted run.[4]

Variants

Buffer Size and Tagging Alternatives

In block sort algorithms, the buffer size plays a critical role in balancing space efficiency and merge performance. The standard approach allocates a buffer of size approximately n\sqrt{n}, where nn is the number of elements, to facilitate efficient block extraction and merging while maintaining near-optimal time complexity.[4] This size allows for tagging and rolling operations on blocks of comparable dimensions, enabling the algorithm to achieve O(n log n) worst-case performance with minimal extra space. Practical variants adjust the buffer size for specific implementations. For example, WikiSort extracts internal buffers of fixed small size (e.g., based on initial block groups of 16 elements, yielding tags for blocks of size 4) to optimize for cache locality while preserving O(1) extra space overall. Similarly, GrailSort uses a small buffer for local merges and relies on block swapping tags to ensure stability without additional space proportional to n.[11] Tagging is essential for preserving stability during block rearrangements. Implementations like GrailSort and WikiSort employ tagging mechanisms, such as marking elements from A blocks using values from an internal buffer, to track origins without physical swaps of entire blocks. These approaches reduce the number of element moves while maintaining the algorithm's guarantees. Optimizations in variants include adjustable initial block sizes for the insertion sort phase (typically 16 but extendable) to minimize merge levels, particularly beneficial for hardware with varying cache sizes. These modifications involve trade-offs in space and time, with smaller or fixed buffers enhancing cache performance on modern processors at the potential cost of more rotations in merging.

External Memory and Hybrid Approaches

Block-based sorting techniques have been adapted for external memory settings to handle datasets larger than available RAM, particularly in information retrieval (IR) systems for constructing inverted indices from large document collections. While not direct extensions of the in-place block sort, methods like blocked sort-based indexing (BSBI) employ similar partitioning into blocks that fit in memory, sort each internally, and merge using disk as temporary storage.[12] This relaxes the O(1) space constraint to O(n) external space, enabling scalability to large corpora such as the 1 GB Reuters-RCV1 collection with 800,000 documents. In BSBI, documents are parsed into termID-documentID pairs, accumulated into blocks (e.g., 10 million pairs per block), sorted in memory, inverted to partial postings lists, written to disk, and merged via multi-way merge with priority queues. Hybrid approaches can enhance block sort variants for adaptivity or parallelism. Integrations with run detection, akin to TimSort, identify existing sorted runs to reduce merging overhead, improving average-case performance while keeping O(n log n) worst-case and O(n\sqrt{n}) auxiliary space for tagging. In multi-core settings, parallel variants like those in WikiSort implementations parallelize local block sorts and merges, using barriers for synchronization to leverage hardware concurrency without excessive overhead. Advanced features in block sort variants include indirect sorting via index arrays, permuting references rather than moving elements, suitable for non-movable objects in databases. These methods are applied in IR and database systems for efficient indexing under memory constraints.[12]

Implementations

Key Software Implementations

WikiSort represents a foundational open-source implementation of block sort, specifically as a stable, in-place block merge sort algorithm. Released in 2014 by developer Mike McFadden on GitHub, it operates with O(1) extra memory and is available in C, C++, and Java versions.[2] The algorithm draws from the ratio-based stable in-place merging technique outlined by Kim and Kutzner, enabling efficient merging without significant temporary storage.[4] Its design emphasizes speed, making it applicable in performance-critical scenarios like data processing tools. GrailSort extends block sort principles with targeted optimizations for contemporary hardware architectures. Introduced around 2019 by Andrey Astrelin, this C-compatible implementation guarantees worst-case O(n log n) time while preserving stability and in-place operation.[10] It builds on foundational work by Huang and Langston from 1989–1992, incorporating variants such as buffer-assisted versions for further acceleration.[10] Benchmarks within the repository highlight its comparative advantages, such as up to 23% faster performance than standard stable sorts on large datasets with structured keys.[10] Block sort has seen integration into specialized C++ libraries, notably cpp-sort, which adapts WikiSort as its wiki_sorter for generic sorting needs.[13] The original WikiSort repository also provides a Java port, facilitating use in JVM-based environments.[2] These efforts underscore the algorithm's portability across languages and frameworks. Common traits across these implementations include non-recursive structures to mitigate stack limitations, dynamically adjustable block sizes tuned for CPU cache efficiency, and public domain licensing to promote broad reuse.[2][10] Such attributes enhance suitability for memory-sensitive applications, including real-time systems and large-scale data handling.

Optimizations and Recent Developments

Since 2020, several optimizations have enhanced the cache locality and performance of block sort implementations, particularly through forks and adaptations of WikiSort. Additionally, a 2023 OpenMP-based implementation, MPDMSort, optimizes block partitioning using multi-deque structures to reduce compare-and-swap operations, achieving 13.81× overall speedup on 16-core CPUs for block sizes around 1024 elements.[14] These developments address gaps in earlier references by incorporating modern hardware features, as evidenced by active contributions in open-source repositories and peer-reviewed benchmarks from 2023.

Analysis

Time and Space Complexity

Block sort exhibits a time complexity of O(nlogn)O(n \log n) in both the worst and average cases, stemming from an initial insertion sort over small blocks that requires O(n)O(n) time, followed by logn\log n iterative merge phases where each phase performs rolls, drops, and local merges in linear time overall. The core of this efficiency lies in the stable in-place merging procedure, which for combining two sorted sequences of lengths mm and n=kmn = k m (with mnm \leq n) incurs O(mlog(k+1))O(m \log (k + 1)) comparisons and O(m(k+1))O(m (k + 1)) assignments, ensuring that balanced merges at each phase cost O(n)O(n) while unbalanced cases are managed to avoid excess overhead across phases. The total time sums to O(nlogn)O(n \log n) as elements participate in O(logn)O(\log n) merges, with the harmonic progression of block sizes (doubling iteratively) contributing to the logarithmic factor without inflating constants beyond practical bounds. Practical implementations choose initial block sizes around 2k2^{k} where klogn/2k \approx \log n / 2 to balance initial sort and merge costs, leading to higher constants than out-of-place merge sort.[4] In the best case, such as when the input is presorted or contains long natural runs, block sort's run detection mechanism skips redundant merge phases, yielding O(n)O(n) time dominated by the initial scan and block sorting. The choice of block size bb influences practical constants, with larger bb reducing the number of phases but increasing initial insertion sort costs; adaptivity further mitigates work on partially sorted data by minimizing skips only when necessary. Each merge phase breaks down as O(n)O(n) for roll and drop operations to rearrange blocks, plus O(nlogb)O(n \log b) for local merges within current block sizes, but the progression of bb across phases ensures the cumulative local merge cost aligns with the overall O(nlogn)O(n \log n) bound via summation akin to the harmonic series.[4] Regarding space complexity, block sort operates in-place with O(1)O(1) auxiliary space beyond the input array, utilizing temporary buffers carved from the array itself for tagging and merging without external allocation. During local merges, a temporary buffer of size O(b)O(\sqrt{b}) (where bb is the current block size) suffices for stable in-place operations; this buffer is extracted from the input array, maintaining the O(1)O(1) auxiliary space bound despite temporary use of up to O(n)O(\sqrt{n}) positions within the array during final phases. The block size bb also tunes space usage, with fixed small bb keeping buffers minimal in internal variants.[4]

Stability and Adaptivity

Block sort is a fully stable sorting algorithm, meaning it preserves the relative order of elements with equal keys throughout the sorting process. This stability is achieved through a combination of the initial block division and the use of stable local merging techniques in subsequent phases. Local merges, such as those based on the Hwang-Lin algorithm, further preserve equality by sequentially advancing pointers in the input runs without introducing inversions for equal elements.[4][15] A proof sketch for stability relies on the invariant that positional information is maintained across block creation and drops, where drops refer to the redistribution of elements without altering their relative order among equals. Since no inversions are introduced during these drops—elements are simply rotated or shifted based on binary search positions—the original order is upheld. Local merges are designed to be stable by construction, as they process elements in a manner analogous to standard merge sort but adapted for in-place operation using block rotations, ensuring that when equal keys are encountered, the one from the earlier run is placed first. This combination guarantees overall stability without additional space overhead.[4] Block sort exhibits adaptivity by detecting and skipping sorted runs or blocks, which allows it to achieve O(n time complexity on nearly sorted inputs. Prior to merging two subarrays A and B, the algorithm performs pre-merge checks, such as monotonicity tests, to verify if the combined sequence is already in order; if so, the merge is skipped entirely. This makes block sort more adaptive than plain merge sort, which always performs full merges regardless of input patterns, but less so than TimSort, which explicitly identifies and merges natural runs of varying lengths for greater exploitation of presorted data.[4]

Performance Characteristics

Advantages

Block sort achieves notable memory efficiency via its in-place design, utilizing only O(1) extra space for auxiliary operations such as a small buffer for local merges, without requiring additional arrays proportional to the input size. This characteristic renders it ideal for resource-constrained environments like embedded systems or processing large datasets in memory-limited settings where dynamic allocation is impractical or prohibited.[16] The algorithm delivers predictable performance through a bottom-up merging strategy that employs fixed block sizes, circumventing the recursion depth limitations and stack overflow risks inherent in top-down recursive sorts. It maintains stability by preserving the relative order of equal elements during merges and ensuring consistent O(n log n) worst-case time complexity without pathological degradation.[16] Block sort enhances cache performance on contemporary hardware by performing localized block operations and merges, which promote spatial and temporal locality, thereby minimizing cache misses. Its versatile structure facilitates extensions to external memory scenarios, where blocks are processed sequentially to fit within available RAM, and enables parallelization by independently sorting disjoint blocks before merging, making it adaptable to distributed or multi-core systems without proprietary restrictions.[16]

Disadvantages and Limitations

Implementing block sort involves intricate logic for tagging blocks, performing rotations to simulate temporary storage, and conducting in-place merges, making it significantly more prone to bugs and harder to verify than simpler algorithms like quicksort. This complexity arises from the need to manage block identification and stable merging without extra space, often requiring careful handling of edge cases in rotations and pointer manipulations. Block sort incurs higher constant factors due to the overhead of initial block division and tagging, which can make it less efficient for small arrays where setup costs dominate over sorting benefits. Unlike adaptive algorithms such as TimSort, it does not detect or leverage natural runs in the data, leading to unnecessary work on partially sorted inputs and reduced performance in real-world scenarios with existing order. Additionally, its efficiency is sensitive to the chosen block size; suboptimal selections result in excessive overhead from too many small blocks or memory inefficiencies from oversized ones.[17] The in-place nature of block sort complicates parallelization, as the sequential dependencies in local merges and rotations hinder straightforward threading, often requiring complex synchronization to avoid race conditions during data movements. Shifting operations in in-place merges suffer from poor data locality and irregular memory access, exacerbating cache misses in multi-threaded environments, while increasing the number of threads introduces overhead that diminishes returns for smaller inputs. Without specialized variants, local merges are also not easily vectorized, limiting exploitation of SIMD instructions.[18] On highly random data, block sort performs slower than introsort due to its fixed merge structure lacking the pivot-based partitioning that allows introsort to achieve better average-case constants. Although providing worst-case guarantees, its design has led some modern libraries to favor hybrid approaches over pure block sort implementations for broader applicability. Variants like WikiSort have been optimized for modern hardware as of 2025, incorporating cache-friendly block sizes to improve practical performance in resource-constrained applications.
User Avatar
No comments yet.