Recent from talks
Knowledge base stats:
Talk channels stats:
Members stats:
Polyphase merge sort
A polyphase merge sort is a variation of a bottom-up merge sort that sorts a list using an initial uneven distribution of sub-lists (runs), primarily used for external sorting, and is more efficient than an ordinary merge sort when there are fewer than eight external working files (such as a tape drive or a file on a hard drive). A polyphase merge sort is not a stable sort.
A merge sort splits the records of a dataset into sorted runs of records and then repeatedly merges sorted runs into larger sorted runs until only one run, the sorted dataset, remains.
A ‘balanced’ merge sort using four working files organizes them as a pair of input files and a pair of output files. The dataset is distributed evenly between two of the working files, either as sorted runs or in the simplest case, single records, which can be considered to be sorted runs of size 1. Once all of the dataset is transferred to the two working files, those two working files become the input files for the first merge iteration. Each merge iteration merges runs from the two input working files, alternating the merged output between the two output files, again distributing the merged runs evenly between the two output files (until the final merge iteration). Once all of the runs from the two inputs files are merged and output, then the output files become the input files and vice versa for the next merge iteration. The number of runs decreases by a factor of 2 at each iteration, such as 64, 32, 16, 8, 4, 2, 1. For the final merge iteration, the two input files only have one sorted run (1/2 of the dataset) each, and the merged result is a single sorted run (the sorted dataset) on one of the output files. This is also described at Merge sort § Use with tape drives.
If there are only three working files, then a balanced merge sort merges sorted runs from two working files onto a single working file, then distributes the runs evenly between the two output files. The merge iteration reduces run count by a factor of 2, the redistribute iteration doesn't reduce run count (the factor is 1). Each iteration could be considered to reduce the run count by an average factor of √2 ≈ 1.41. If there are 5 working files, then the pattern alternates between a 3 way merge and a 2 way merge, for an average factor of √6 ≈ 2.45.
In general, for an even number N of working files, each iteration of a balanced merge sort reduces run count by a factor of N/2, while for an odd number N of working files, each iteration reduces the run count by an average factor of √(N2−1)/4 = √N2−1/2.
For N < 8 working files, a polyphase merge sort achieves a higher effective run count reduction factor by unevenly distributing sorted runs between N−1 working files (explained in next section). Each iteration merges runs from N−1 working files onto a single output working file. When the end of one of the N−1 working files is reached, then it becomes the new output file and what was the output file becomes one of the N−1 working input files, starting a new iteration of polyphase merge sort. Each iteration merges only a fraction of the dataset (about 1/2 to 3/4), except for the last iteration which merges all of the dataset into a single sorted run. The initial distribution is set up so that only one input working file is emptied at a time, except for the final merge iteration which merges N−1 single runs (of varying size, this is explained next) from the N−1 input working files to the single output file, resulting in a single sorted run, the sorted dataset.
For each polyphase iteration, the total number of runs follows a pattern similar to a reversed Fibonacci numbers of higher order sequence. With 4 files, and a dataset consisting of 57 runs, the total run count on each iteration would be 57, 31, 17, 9, 5, 3, 1. Note that except for the last iteration, the run count reduction factor is a bit less than 2, 57/31, 31/17, 17/9, 9/5, 5/3, 3/1, about 1.84 for a 4 file case, but each iteration except the last reduced the run count while processing about 65% of the dataset, so the run count reduction factor per dataset processed during the intermediate iterations is about 1.84 / 0.65 = 2.83. For a dataset consisting of 57 runs of 1 record each, then after the initial distribution, polyphase merge sort moves 232 records during the 6 iterations it takes to sort the dataset, for an overall reduction factor of 2.70 (this is explained in more detail later).
After the first polyphase iteration, what was the output file now contains the results of merging N−1 original runs, but the remaining N−2 input working files still contain the remaining original runs, so the second merge iteration produces runs of size (N−1) + (N−2) = (2N − 3) original runs. The third iteration produces runs of size (4N − 7) original runs. With 4 files, the first iteration creates runs of size 3 original runs, the second iteration 5 original runs, the third iteration 9 original runs and so on, following the Fibonacci like pattern, 1, 3, 5, 9, 17, 31, 57, ... , so the increase in run size follows the same pattern as the decrease in run count in reverse. In the example case of 4 files and 57 runs of 1 record each, the last iteration merges 3 runs of size 31, 17, 9, resulting in a single sorted run of size 31+17+9 = 57 records, the sorted dataset. An example of the run counts and run sizes for 4 files, 31 records can be found in table 4.3 of.
Hub AI
Polyphase merge sort AI simulator
(@Polyphase merge sort_simulator)
Polyphase merge sort
A polyphase merge sort is a variation of a bottom-up merge sort that sorts a list using an initial uneven distribution of sub-lists (runs), primarily used for external sorting, and is more efficient than an ordinary merge sort when there are fewer than eight external working files (such as a tape drive or a file on a hard drive). A polyphase merge sort is not a stable sort.
A merge sort splits the records of a dataset into sorted runs of records and then repeatedly merges sorted runs into larger sorted runs until only one run, the sorted dataset, remains.
A ‘balanced’ merge sort using four working files organizes them as a pair of input files and a pair of output files. The dataset is distributed evenly between two of the working files, either as sorted runs or in the simplest case, single records, which can be considered to be sorted runs of size 1. Once all of the dataset is transferred to the two working files, those two working files become the input files for the first merge iteration. Each merge iteration merges runs from the two input working files, alternating the merged output between the two output files, again distributing the merged runs evenly between the two output files (until the final merge iteration). Once all of the runs from the two inputs files are merged and output, then the output files become the input files and vice versa for the next merge iteration. The number of runs decreases by a factor of 2 at each iteration, such as 64, 32, 16, 8, 4, 2, 1. For the final merge iteration, the two input files only have one sorted run (1/2 of the dataset) each, and the merged result is a single sorted run (the sorted dataset) on one of the output files. This is also described at Merge sort § Use with tape drives.
If there are only three working files, then a balanced merge sort merges sorted runs from two working files onto a single working file, then distributes the runs evenly between the two output files. The merge iteration reduces run count by a factor of 2, the redistribute iteration doesn't reduce run count (the factor is 1). Each iteration could be considered to reduce the run count by an average factor of √2 ≈ 1.41. If there are 5 working files, then the pattern alternates between a 3 way merge and a 2 way merge, for an average factor of √6 ≈ 2.45.
In general, for an even number N of working files, each iteration of a balanced merge sort reduces run count by a factor of N/2, while for an odd number N of working files, each iteration reduces the run count by an average factor of √(N2−1)/4 = √N2−1/2.
For N < 8 working files, a polyphase merge sort achieves a higher effective run count reduction factor by unevenly distributing sorted runs between N−1 working files (explained in next section). Each iteration merges runs from N−1 working files onto a single output working file. When the end of one of the N−1 working files is reached, then it becomes the new output file and what was the output file becomes one of the N−1 working input files, starting a new iteration of polyphase merge sort. Each iteration merges only a fraction of the dataset (about 1/2 to 3/4), except for the last iteration which merges all of the dataset into a single sorted run. The initial distribution is set up so that only one input working file is emptied at a time, except for the final merge iteration which merges N−1 single runs (of varying size, this is explained next) from the N−1 input working files to the single output file, resulting in a single sorted run, the sorted dataset.
For each polyphase iteration, the total number of runs follows a pattern similar to a reversed Fibonacci numbers of higher order sequence. With 4 files, and a dataset consisting of 57 runs, the total run count on each iteration would be 57, 31, 17, 9, 5, 3, 1. Note that except for the last iteration, the run count reduction factor is a bit less than 2, 57/31, 31/17, 17/9, 9/5, 5/3, 3/1, about 1.84 for a 4 file case, but each iteration except the last reduced the run count while processing about 65% of the dataset, so the run count reduction factor per dataset processed during the intermediate iterations is about 1.84 / 0.65 = 2.83. For a dataset consisting of 57 runs of 1 record each, then after the initial distribution, polyphase merge sort moves 232 records during the 6 iterations it takes to sort the dataset, for an overall reduction factor of 2.70 (this is explained in more detail later).
After the first polyphase iteration, what was the output file now contains the results of merging N−1 original runs, but the remaining N−2 input working files still contain the remaining original runs, so the second merge iteration produces runs of size (N−1) + (N−2) = (2N − 3) original runs. The third iteration produces runs of size (4N − 7) original runs. With 4 files, the first iteration creates runs of size 3 original runs, the second iteration 5 original runs, the third iteration 9 original runs and so on, following the Fibonacci like pattern, 1, 3, 5, 9, 17, 31, 57, ... , so the increase in run size follows the same pattern as the decrease in run count in reverse. In the example case of 4 files and 57 runs of 1 record each, the last iteration merges 3 runs of size 31, 17, 9, resulting in a single sorted run of size 31+17+9 = 57 records, the sorted dataset. An example of the run counts and run sizes for 4 files, 31 records can be found in table 4.3 of.