This site is a static rendering of the Trac instance that was used by R7RS-WG1 for its work on R7RS-small (PDF), which was ratified in 2013. For more information, see Home.

Source for wiki SortingShivers version 3

author

cowan

comment


    

ipnr

127.11.51.1

name

SortingShivers

readonly

0

text

== Title ==

Sorting library

== Author ==

Olin Shivers (edited by John Cowan)

== Issues ==

1. Maybe `vector-delete-neighbor-dups` should take a "target" vector to write?

== Table of contents ==

* Abstract
* Procedure index
* Editor's preface
* Introduction
* Design rules
  * What vs. how
  * Consistency across function signatures
  * Data parameter first, less-than parameter after
  * Ordering, comparison functions and stability
  * All vector operations accept optional subrange parameters
  * Required vs. allowed side-effects
* Specification
  * Procedure naming and functionality
  * Types of parameters and return values
  * Basic procedures
  * Algorithm-specific sorting procedures
  * Algorithmic properties
* Topics to be resolved during discussion phase
* Porting and optimisation
* Acknowledgements
* Copyright


== Abstract ==

I've designed the API for a full-featured sort toolkit, which I propose as a SRFI.
The spec comes with 1200 lines of high-quality reference code: tightly
written, highly commented, portable code, available for free.

== Procedure index ==

{{{
list-sorted?			vector-sorted?

list-merge			vector-merge
list-sort			vector-sort
list-stable-sort		vector-stable-sort

list-merge!			vector-merge!
list-sort!			vector-sort!
list-stable-sort!		vector-stable-sort!
list-delete-neighbor-dups!	vector-delete-neighbor-dups!

list-delete-neighbor-dups	vector-delete-neighbor-dups

list-merge-sort   vector-merge-sort    heap-sort   quick-sort   insert-sort   
list-merge-sort!  vector-merge-sort!   heap-sort!  quick-sort!  insert-sort!  
}}}

== Editor's preface ==

What happened to [http://srfi.schemers.org/srfi-32/srfi-32.txt SRFI 32]?
It's been twelve years since it was withdrawn.
R7RS-large needs a sorting package, and this is a substantial one.  I've made
as few adjustments as practical in order to bring the specification into 2014.
(In this section, "I" means John Cowan; elsewhere, "I" means Olin Shivers.)

There are two main functional changes.  The most important one is that the
comparison predicate now precedes the data rather than following it.
Most of the SRFI 32 commenters thought that this is the better order, despite
the widespread precedent, and I've made the change.  The other point is that
the procedures accept
[http://srfi.schemers.org/srfi-114/srfi-114.html SRFI 114] comparators
as well as predicates.

Editorially, I have reorganized the text and removed the massive redundancies.

== Introduction ==

I have designed and 
written a fairly comprehensive sorting and merging toolkit. It is
very portable and freely reusable.

The package includes:

* Vector insert sort (stable)
* Vector heap sort
* Vector quick sort (with median-of-3 pivot picking)
* Vector merge sort (stable)
* Pure and destructive list merge sort (stable)
* Stable vector and list merge
* Miscellaneous sort-related procedures: Vector and list merging, `*-sorted?` predicates, vector and list delete-equal-neighbor procedures.
* A general, non-algorithmic set of procedure names for general sorting and merging.

The code is tightly bummed. It is clearly written, and commented in my usual
voluminous style. This includes notes on porting and implementation-specific
optimisations.


== Design rules ==

=== What vs. how ===

There are two different interfaces: "what" (simple) and "how" (detailed).

* Simple: you specify semantics: datatype (list or vector), mutability, and stability.

* Detailed: you specify the actual algorithm (quick, heap, insert, merge). Different algorithms have different properties, both semantic and pragmatic, so these exports are necessary.

It is necessarily the case that the specifications of these procedures
make statements about execution "pragmatics." For example, the sole
distinction between heap sort and quick sort — both of which are
provided by this library — is one of execution time, which is not a
"semantic" distinction. Similar resource-use statements are made about
"iterative" procedures, meaning that they can execute on input of
arbitrary size without needing to allocate an unbounded number of stack
frames.

=== Consistency across function signatures ===

Procedures share common function signatures wherever
possible, to facilitate switching a given call from one procedure
to another.
	
=== Comparison parameter before data parameter ===

These procedures uniformly observe the following parameter order:
the data to be sorted come after the comparison function.
In SRFI 32, the reverse convention was used, because it was
consistent with all the implementations examined in 2002, with
the sole exception of Chez Scheme.  However,
[http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-5.html#node_chap_4 R6RS]
adopted the reverse
convention, which is more consistent with other Scheme libraries that
put the ordering function first — the "operation currying" convention,
and this SRFI uses the R6RS order.
(E.g., consider `for-each` or `map` or `find`.)

=== Ordering, comparison functions and stability ===

These routines take a < comparison function, not a <= comparison
function, and they sort into increasing order. The difference between
a < spec and a <= spec comes up in two places: 

* the definition of an ordered or sorted data set, and 
* the definition of a stable sorting algorithm.

We say that a data set (a list or vector) is ''sorted'' or ''ordered''
if it contains no adjacent pair of values ... X Y ... such that Y < X. 

In other words, scanning across the data never takes a "downwards" step.

If you use a <= procedure where these algorithms expect a <
procedure, you may not get the answers you expect. For example,
the `list-sorted?` function will return false if you pass it a <= comparison
function and an ordered list containing adjacent equal elements.

A "stable" sort is one that preserves the pre-existing order of equal
elements. Suppose, for example, that we sort a list of numbers by 
comparing their absolute values, i.e., using comparison function
`(lambda (x y) (< (abs x) (abs y)))`.
If we sort a list that contains both 3 and -3,
then a stable sort is an algorithm that will not swap the order
of these two elements, that is, the answer will look like `... 3 -3 ...`,
not `... -3 3 ...`.

Choosing < for the comparison function instead of <= affects how stability
is coded. Given an adjacent pair X Y, (< y x) means "Y should be moved in
front of X" — otherwise, leave things as they are. So using a <= function
where a < function is expected will ''invert'' stability.

This is due to the definition of equality, given a < comparator:
`(and (not (< x y)) (not (< y x)))`.
The definition is rather different, given a <= comparator:
`(and (<= x y) (<= x y))`.

A "stable" merge is one that reliably favors one of its data sets
when equal items appear in both data sets.
''All merge operations in this library are stable'',
breaking ties between data sets in favor
of the first data set — elements of the first list come before equal 
elements in the second list.

So, if we are merging two lists of numbers ordered by absolute value
using the stable merge operation `list-merge`, then

{{{
(list-merge (lambda (x y) (< (abs x) (abs y)))
            '(0 -2 4 8 -10) '(-1 3 -4 7))
}}}

reliably places the 4 of the first list before the equal-comparing -4
of the second list: `(0 -1 -2 4 -4 7 8 -10)`.

In short, if your comparison function ''f'' answers true to `(`''f x x''`)`, then 
using a stable sorting or merging algorithm will not give you a stable sort
or merge, and `list-sorted?` may surprise you. Note that you can synthesize a <
function from a <= function with `(lambda (x y) (not (<= y x)))`
if need be. 

Precise definitions give sharp edges to tools, but require care
in use. "Measure twice, cut once."

I have adopted the choice of < from Common Lisp. I assume they
had a good reason for adopting < instead of <=. I'd love to know
what this reason is; send me email if you can explain it, please.

This version of the specification also accepts SRFI 114 comparators
in place of procedures, which buys some reliability and possibly some
efficiency.

=== All vector operations accept optional subrange parameters ===

The vector operations specified below all take optional ''start'' and ''end''
arguments indicating a selected subrange of a vector's elements.

=== Required vs. allowed side-effects ===

`list-sort!` and `list-stable-sort!` are allowed, but not required,
to alter their arguments' cons cells to construct the result list. This is
consistent with the what-not-how character of the group of procedures
to which they belong (the "sort-lib" package).

The `list-delete-neighbor-dups!`, `list-merge!` and `list-merge-sort!` procedures,
on the other hand, provide specific algorithms, and, as such, explicitly
commit to the use of side-effects on their input lists in order to guarantee
their key algorithmic properties (e.g., linear-time operation, constant-space
stack use).

== Specification ==

=== Procedure naming and functionality ===

Almost all of the procedures described below are variants of two basic
operations: sorting and merging. These procedures are consistently named
by composing a set of basic lexemes to indicate what they do.


||Lexeme||Meaning
||`sort`||The procedure sorts its input data set by some < comparison function.||
||`merge`||The procedure merges two ordered data sets into a single ordered result.||
||`stable`||This lexeme indicates that the sort is a stable one.||
||`vector`||The procedure operates upon vectors.||
||`list`||The procedure operates upon lists.||
||`!`||Procedures that end in "!" are allowed, and sometimes required, to reuse their input storage to construct their answer.||


=== Types of parameters and return values ===

In the procedures specified below,

* A ''lis'' parameter is a list;

* A ''v'' parameter is a vector;

* A < or = parameter is a procedure accepting two arguments taken from the specified procedure's data set(s), and returning a boolean;

* A ''start'' parameter or ''start'' and ''end'' parameter pair is given to such a procedure, they must be exact, non-negative integers, such that 0 <= ''start'' <= ''end'' <= `(vector-length `''v''`)`, where ''v'' is the related vector parameter. If not specified, they default to 0 and the length of the vector, respectively. They are interpreted to select the range [''start'', ''end''), that is, all elements from index ''start'' (inclusive) up to, but not including, index ''end''.

Passing values to procedures with these parameters that do not satisfy these
types is an error.

If a procedure is said to return "an unspecified value", this means that nothing at all
is said about what the procedure returns, except that it returns one value.

=== Predicates ===

`(list-sorted? `''< lis''`)`

`(vector-sorted? `'' < v'' [''start'' [ ''end'' ] ]

These procedures return true if their input list or vector
is in sorted order, as determined by their < comparison parameter.
Specifically, they return `#f` iff there is an adjacent pair ... X Y ... in the input
list or vector such that Y < X. The optional ''start'' and ''end'' range 
arguments restrict `vector-sorted?` to examining the indicated subvector.

=== General sort procedures ===

These procedures provide basic sorting and merging functionality suitable for
general programming. The procedures are named by their semantic properties,
i.e., what they do to the data (sort, stable sort, and so forth).

`(list-sort ` ''< lis''`)`

`(list-stable-sort ` ''< lis''`)`

These procedures
do not alter their inputs and are allowed to return a value that shares 
a common tail with a list argument.

The `list-stable-sort` procedure is equivalent to the R6RS `list-sort` procedure.

`(list-sort! ` ''< lis''`)`

`(list-stable-sort! ` ''< lis''`)`

These procedures
are "linear update" operators — they are allowed, but not required, to
alter the cons cells of their arguments to produce their results.

`(vector-sort ` ''< v'' [ ''start'' [ ''end'' ] ]`)`

`(vector-stable-sort ` ''< v'' [ ''start'' [ ''end'' ] ]`)`

These procedures
do not alter their inputs, but allocate a fresh vector for their result,
of length ''end'' - ''start''.
 
The `vector-stable-sort` procedure with no optional arguments
is equivalent to the R6RS `vector-sort` procedure.

`(vector-sort! ` ''< v'' [ ''start'' [ ''end'' ] ]`)`

`(vector-stable-sort! ` ''< v'' [ ''start'' [ ''end'' ] ]`)`

These procedures
sort their data in-place. (But note that `vector-stable-sort!` may 
allocate temporary storage proportional to the size of the input —
I am not aware of O(n lg n) stable vector sorting algorithms that
run in constant space.)

The `vector-sort!` procedure with no optional arguments is equivalent
to the R6RS `vector-sort!` procedure.

=== Merge procedures ===

All four merge operations are stable: an element of the initial list ''lis1''
or vector ''v1'' will come before an equal-comparing element in the second
list ''lis2'' or vector ''v2'' in the result.

`(list-merge `''< lis1 lis2''`)`

This procedure
does not alter its inputs, and is allowed to return a value that shares 
a common tail with a list argument.

`(list-merge! `''< lis1 lis2''`)`

This procedure
makes only a single, iterative, linear-time pass over its argument lists,
using `set-cdr!`s to rearrange the cells of the lists into the final result
— they work "in place." Hence, any cons cell appearing in the result must
have originally appeared in an input.

Additionally,
`list-merge!` is iterative, not recursive — it can operate on arguments of
arbitrary size without requiring an unbounded amount of stack space.
The intent of this
iterative-algorithm commitment is to allow the programmer to be sure that
if, for example, `list-merge!` is asked to merge two ten-million-element
lists, the operation will complete without performing some extremely
(possibly twenty-million) deep recursion.

`(vector-merge` ''< v1 v2'' [ ''start1'' [ ''end1'' [ ''start2'' [ ''end2'' ] ] ] ]`)`

This procedure does not alter its inputs,
and returns a newly allocated vector
of length (''end1'' - ''start1'') + (''end2'' - ''start2''`)`.

`(vector-merge!` ''< v0 v1 v2'' [ ''start0'' [ ''start1'' [ ''end1'' [ ''start2'' [ ''end2'' ] ] ] ] ]`)`

Writes its result into vector ''v0'', beginning at index ''start0'',
for indices less than
''end0''  = ''start0'' + (''end1'' - ''start1'') + (''end2'' - ''start2''`)`.
The target subvector ''v0''[''start0'', ''end0'') may not overlap either source subvector,
''v1''[''start1'', ''end1'') or ''v2''[''start2'', ''end2'').

== Deleting duplicate neighbors ==

These procedures delete adjacent duplicate elements from a list or a
vector, using a given element-equality procedure. The first/leftmost
element of a run of equal elements is the one that survives. The list or
vector is not otherwise disordered.

These procedures are linear time — much faster than the O(n^2^) general
duplicate-element deletion procedures that do not assume any "bunching" of elements
(such as the ones provided by [http://srfi.schemers.org/srfi-1/srfi-1.html SRFI 1]).
If you want to delete duplicate
elements from a large list or vector, sort the elements to bring equal
items together, then use one of these procedures, for a total time of
O(n lg n).

The comparison function = passed to these procedures is always applied
as `(= x y)`,
where X comes before Y in the containing list or vector.

`(list-delete-neighbor-dups! ` ''= lis''`)`

This procedure mutates its input list in order to construct its answer.
It makes only a single, iterative, linear-time pass over their
argument lists, using `set-cdr!`s to rearrange the cells of the lists
into the final result — they work "in place." Hence, any cons cell appearing
in the result must have originally appeared in an input.

`(vector-delete-neighbor-dups! ` ''= lis''`)`

This procedure reuses its input vector to hold the answer, packing its answer into the index range [start,newend), 

where `newend` is the non-negative exact integer returned as its value. It returns `newend` as its result. The vector 

is not altered outside the range [''start, newend'').

`(list-delete-neighbor-dups `''= v''`)`

This procedure does not alter its input list; its answer may share
storage with the input list.

`(vector-delete-neighbor-dups ` ''= v''`)`

This procedure does not alter its input vector, but rather
allocates and returns a fresh vector to hold the result.

Examples:

{{{
	(list-delete-neighbor-dups = '(1 1 2 7 7 7 0 -2 -2))
               => (1 2 7 0 -2)

	(vector-delete-neighbor-dups = '#(1 1 2 7 7 7 0 -2 -2))
               => #(1 2 7 0 -2)

	(vector-delete-neighbor-dups = '#(1 1 2 7 7 7 0 -2 -2) = 3 7))
               => #(7 0 -2)

;; Result left in v[3,9):
(let ((v (vector 0 0 0 1 1 2 2 3 3 4 4 5 5 6 6)))
  (cons (vector-delete-neighbor-dups! = v 3)
        v))
              => (9 . #(0 0 0 1 2 3 4 5 6 4 4 5 5 6 6))
}}}

=== Algorithm-specific sorting procedures ===

These packages provide more specific sorting functionality, that is,
specific commitment to particular algorithms that have particular
pragmatic consequences (such as memory locality, asymptotic running time)
beyond their semantic behaviour (sorting, stable sorting, merging, etc.).
Programmers that need a particular algorithm can use one of these procedures.

`(list-merge-sort `''< lis''`)`

`(list-merge-sort! ''< lis''`)`

These procedures sort their data using a list merge sort, which is
stable. (The sample implementation is, additionally, a "natural" sort.
See below for the properties of this algorithm.)
The `list-merge-sort` procedure returns a newly allocated list.
The `list-merge-sort!` procedure is destructive — it uses `set-cdr!`s to rearrange the
cells of the lists into the proper order. As such, it does not allocate
any extra cons cells — it is an "in place" sort.  It returns the sorted list.

`(vector-merge-sort `''< v'' [ ''start'' [ ''end'' [ ''temp'' ] ] ]`)`

`(vector-merge-sort! `''< v'' [ ''start'' [ ''end'' [ ''temp'' ] ] ]`)`

These procedures sort their data using vector merge sort, which is
stable. (The sample implementation is, additionally, a "natural" sort.
See below for the properties of this algorithm.)
The `vector-merge-sort` procedure returns a newly allocated vector of length `end-start`.
The `vector-merge-sort!` procedure leaves its result in ''v''[''start'', ''end'').

`vector-merge` returns a vector of length (''end1'' - ''start1'') + (''end2'' - ''start2'').

`(heap-sort ` ''< v'' [ ''start'' [ ''end'' ] ]`)`

`(heap-sort! ` ''< v'' [ ''start'' [ ''end'' ] ]`)`

These procedures sort their data using heap sort, 
which is not a stable sorting algorithm. The
`heap-sort` procedure returns a vector of length `end-start`. 
The `heap-sort!` procedure is in-place, leaving its result in ''v''[''start'', ''end'').

`(quick-sort `''< v'' [ ''start'' [ ''end'' ] ]`)`

`(quick-sort! `''< v'' [ ''start'' [ ''end'' ] ]`)`

These procedures sort their data using quick sort, 
which is not a stable sorting algorithm.
The `quick-sort` procedure returns a vector of length `end-start`. 
The `quick-sort!` procedure is in-place, leaving its result in ''v''[''start'', ''end'').

`(insert-sort `''< v'' [ ''start'' [ ''end'' ] ]`)`

(`insert-sort! `''< v'' [ ''start'' [ ''end'' ] ]`)`

These procedures stably sort their data using insertion sort.

`insert-sort` returns a vector of length `end-start`.
`insert-sort!` is in-place, leaving its result in ''v''[''start'', ''end'').

=== Algorithmic properties ===

Different sort and merge algorithms have different properties.
Choose the algorithm that matches your needs:

 Vector insert sort::
  Stable, but only suitable for small vectors — O(n^2).

 Vector quick sort::
  Not stable. Is fast on average — O(n lg n) — but has bad worst-case
  behaviour. Has good memory locality for big vectors (unlike heap sort). 
  A clever pivot-picking trick (median of three samples) helps avoid 
  worst-case behaviour, but pathological cases can still blow up.

 Vector heap sort::
  Not stable. Guaranteed fast — O(n lg n) ''worst'' case. Poor locality
  on large vectors. A very reliable workhorse.

 Vector merge sort::
  Stable. Not in-place — requires a temporary buffer of equal size. 
  Fast — O(n lg n) — and has good memory locality for large vectors.
  The implementation of vector merge sort provided by this SRFI's sample
  implementation is, additionally, a "natural" sort, meaning that it
  exploits existing order in the input data, providing O(n) best case.

 Destructive list merge sort::
  Stable, fast and in-place (i.e., allocates no new cons cells). "Fast"
  means O(n lg n) worst-case, and substantially better if the data
  is already mostly ordered, all the way down to linear time for
  a completely-ordered input list (i.e., it is a "natural" sort).

 Pure list merge sort::
  Stable and fast — O(n lg n) worst-case, and possibly better, depending
  upon the input list (see above).

Note that sorting lists involves chasing pointers through memory, which
can be a loser on modern machine architectures because of poor cache and
page locality. Pointer ''writing'', which is what the `set-cdr!`s of a
destructive list-sort algorithm do, is even worse, especially if your
Scheme has a generational `gc` — the writes will thrash the write-barrier.
Sorting vectors has inherently better locality.

This SRFI's destructive list merge and merge sort implementations are
opportunistic — they avoid redundant `set-cdr!`s, and try to take long
already-ordered runs of list structure as-is when doing the merges.


||Algorithm||Stable?||Worst case||Average case||In-place||
||V insert||Yes||O(n^2)||O(n^2)||Yes||
||V quick||No||O(n^2)||O(n lg n)||Yes||
||V heap||No||O(n lg n)||O(n lg n)||Yes||
||V merge||Yes||O(n lg n)||O(n lg n)||No||
||L merge||Yes||O(n lg n)||O(n lg n)||Either||

Note that there are no list insert sort procedures, as you might as well always
use list merge sort. The sample implementation's destructive list merge
sort will do fewer SET-CDR!s than a destructive insert sort.

== Porting and optimisation ==

This package should be trivial to port. There are only four non-R4RS bits
in the code:

* Use of multiple-value return, with the R5RS `values` procedure, and the simple [http://srfi.schemers.org/srfi-8/srfi-8.html SRFI 8] `(receive  (`''var'' ...`)` ''mv-exp body'' ... `)` multiple-value binding macro.

* A `vector-copy` procedure. This is a tiny little procedure: `(vector-copy `''v [ ''start'' [ ''end'' ] ]`)`.  It is part of R7RS-small.

* The `let-optionals` macro from scsh to parse and default optional arguments to three routines. Again, easy to port the macro or rewrite the code to parse, default, and error check the args by hand. 

* An `error` function for complaining about bad arguments.  It is part of R7RS-small.

This code is tightly bummed, as far as I can go in portable Scheme.

You could speed up the vector code a lot by error-checking the procedure
parameters and then shifting over to fixnum-specific arithmetic and
dangerous vector-indexing and vector-setting primitives. The comments
in the code indicate where the initial error checks would have to be
added. There are several `(quotient` ''n'' 2)` calls that could be changed to a
fixnum right-shift, as well, in both the list and vector code. The code
is designed to enable this — each file usually exports one or two "safe"
procedures that end up calling an internal "dangerous" primitive. The
little exported cover procedures are where you move the error checks.

This should provide ''big'' speedups. In fact, all the code bumming I've done
pretty much disappears in the noise unless you have a good compiler and also
can dump the vector-index checks and generic arithmetic — so I've really just
set things up for you to exploit.

The optional-arg parsing, defaulting, and error checking is done with a
portable R4RS macro. But if your Scheme has a faster mechanism (e.g., Chez),
you should definitely port over to it. Note that argument defaulting and
error-checking are interleaved — you don't have to error-check defaulted
''start'' and ''end'' args to see if they are fixnums that are legal vector indices for
the corresponding vector, etc.

== Acknowledgements ==

I thank the authors of the open source I consulted when designing this
library, particularly Richard O'Keefe, Donovan Kolby and the MIT Scheme Team.

== Copyright ==

=== SRFI text ===

This document is copyright (C) Olin Shivers (1998, 1999). 
All Rights Reserved. 

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

=== Sample implementation ===

Short summary: no restrictions.

While I wrote all of this code myself, I read a lot of code before I began
writing. However, all such code is, itself, either open source or public
domain, rendering irrelevant any issue of "copyright taint."

The natural merge sorts (pure list, destructive list, and vector) are not only
my own code, but are implementations of an algorithm of my own devising. They
run in O(n lg n) worst case, O(n) best case, and require only a logarithmic
number of stack frames. And they are stable. And the destructive-list variant
allocates zero cons cells; it simply rearranges the cells of the input list.

Hence the sample implementation is Copyright © 1998 by Olin Shivers
and made available under the same copyright as the SRFI text (see above).

time

2014-08-06 09:44:53

version

3