This site is a static rendering of the Trac instance that was used by R7RS-WG1 for its work on R7RS-small (PDF), which was ratified in 2013. For more information, see Home. For a version of this page that may be more recent, see ImmutableDataStructuresWortman in WG2's repo for R7RS-large.

Immutable­Data­Structures­Wortman

cowan
2013-06-19 10:32:25
12Removing obsolete commentshistory
source

Introduction

This proposal defines immutable data structures for queues, sets, and maps. A structure is immutable when all its operations leave the structure unchanged. Note that none of the procedures specified here ends with a !.

Immutable structures are sometimes called persistent and are closely related to pure-functional (a.k.a. pure) structures. The availability of immutable data structures facilitates writing efficient programs in the pure-functional style.

Efficiency bounds

We specify required time efficiency upper bounds using big-O notation. We note when, in some cases, there is "slack" between the required bound and the theoretically optimal bound for an operation. Implementations may use data structures with amortized time bounds, but should document which bounds hold in only an amortized sense. The use of randomized data structures with expected time bounds is discouraged.

Deque

A deque (conventionally pronounced "deck") is a queue data structure that supports efficient manipulation of either of its ends. It may be used in place of a (LIFO) stack or (FIFO) queue.

The unlabeled finger tree data structure can meet all these requirements rather conveniently.

Construction

(ideque [element...])

Create a new deque containing element.... The leftmost element (if any) will be at the front of the deque and the rightmost element (if any) will be at the back. Takes O(n) time, where n is the number of elements.

Queries

(ideque? x) (ideque-empty? deque) (ideque-length deque)

ideque?}}} and {{{ideque-empty?}}} take O(1) time; {{{ideque-length may take O(n) time, though O(1) is optimal.

Queue operations

(ideque-front deque) (ideque-back deque)

Return the front/back element. It is an error for deque to be empty. Each takes O(1) time.

(ideque-pop-front deque) (ideque-pop-back deque)

Return a new deque with the front/back element removed. It is an error for deque to be empty. Each takes O(1) time.

(ideque-push-front deque x) (ideque-push-back deque x)

Return a new deque with x}}} pushed to the front/back of {{{deque. Each takes O(1) time.

Miscellaneous

(ideque-append deque...)

Return a new deque with the contents of the first argument deque followed by the others. Takes O(kn) time, where k is the number of deques and n is the number of elements involved, though O(k log n) is possible.

Conversion

(list->ideque list) (ideque->list deque)

Conversion between deque and list structures. FIFO order is preserved, so the front of a list corresponds to the front of a deque. Each operation takes O(n) time.

Set

A sorted set data structure stores a finite collection of unique orderable (a.k.a. comparable or sortable) elements.

These requirements can be satisfied by many flavors of self-balancing binary trees. Red-black trees, 1-2 brother trees, and labeled 2-3 finger trees are particularly convenient in an immutable context.

Order predicates

An order predicate for a universe of orderable elements is a procedure precedes?}}} such that, for set elements of the universe {{{x}}} and {{{y,

(precedes? x y)

returns non-#false}}} iff {{{x}}} precedes {{{y}}} in the set's ordering. It returns {{{#false}}} when {{{x}}} succeeds (comes after) {{{y or when the elements are tied for the same position.

It is essential that order predicates return #false in the case of equal elements, as implementations may test for element equality with expressions of the form

(and (not (precedes? x y)) (not (precedes? y x)))

Note that <}}}, {{{char<?}}}, and {{{string<? are valid order predicates for sets of numbers, characters, and strings, respectively.

The time bounds stated below are all based on the assumption that evaluating an order predicate takes O(1) time.

Deduplication policy

All elements of a set must be unique, which complicates operations that produce sets from collections that may contain duplicates. The procedures described consintely follow a least recently used removal policy. That means that, when an operation encounters two elements that are equal according to the order predicate, the one that was encountered earlier is removed from the set and replaced with the element that was encountered later.

Construction

(iset precedes? [element...])

Return a new set using the order predicate precedes?}}} and containing elements {{{element...}}}. If any elements are equal according to {{{precedes?, only the rightmost of them is retained. O(n log n) time.

Set-wide queries

(iset? x) (iset-empty? set) (iset-size set) (iset-order-predicate set)

iset-size may take O(n) time, though O(1) is optimal. The other procedures take O(1) time.

Set-theoretic operations

(iset-union set...) (iset-intersection set...) (iset-difference left set...) (iset-xor set1 set2)

Return a new set containing the union/intersection/difference/symmetric difference of the arguments. All the arguments must be sets sharing an equivalent order predicate. The set operator is applied in a left-associative order. When an element is a member of multiple arguments to set-union, the object from the rightmost argument is retained. May take O(kn log n) time, where k is the number of sets and n is the number of elements involved, though O(kn) time is optimal.

Element queries

(iset-member? set x)

Returns non-#false}}} iff {{{x}}} is an element of {{{set}}}, or {{{#false otherwise. Takes O(log n) time.

(iset-min set) (iset-max set)

Returns the least/greatest element of set}}}. It is an error for {{{set to be empty. Takes O(log n) time; O(1) is optimal.

Element operations

(iset-update set x present-proc absent-proc)

A continuation-based universal update procedure. Attempts to find an element match}}} equal to {{{x}}} in {{{set}}} according to the order predicate. When such a {{{match is found, calls

(present-proc match update remove)

which must either tail-call

(update new-element ret)

to replace match}}} with {{{new-element, or tail-call

(remove ret)

to remove match}}} from {{{set}}}. In either case {{{ret}}} will be one of the values returned from the enclosing {{{iset-update call.

When no such match}}} is found, {{{iset-update calls

(absent-proc insert ignore)

which should either tail-call

(insert ret)

to insert x}}} into {{{set, or else tail-call

(ignore ret)

to leave set}}} unchanged. Again, {{{ret}}} will be returned from the enclosing {{{iset-update call.

In any case, iset-update returns

(values new-set ret)

where new-set}}} is a set reflecting the indicated modification, if any; and {{{ret is the return value produced by one of the continuations. It runs in O(log n) time.

(This procedure is based on an analogous procedure for hash tables suggested by Alexey Radul and attributed to Taylor Campbell.)

(iset-find set x [absent-thunk])

Returns the element match}}} equal to {{{x}}} in {{{set}}}, or the result of evaluating {{{(absent-thunk)}}} if no such element exists. If {{{absent-thunk}}} is unspecified then a default procedure that always returns {{{#false is used. Takes O(log n) time.

(iset-include set x) (iset-exclude set x)

Return a new set that certainly does/doesn't include x}}}. {{{iset-include}}} may replace an object equal to {{{x}}} already in {{{set, if any. Each operation takes O(log n) time.

(iset-predecessor set x [absent-thunk]) (iset-successor set x [absent-thunk])

Returns the element that immediately precedes/succeeds x}}} in {{{set}}}'s ordering. If no such element exists because {{{x}}} is the minimum/maximum element, or because {{{set}}} is empty, returns the result of evaluating {{{(absent-thunk)}}}. If {{{absent-thunk}}} is unspecified then a default procedure that always returns {{{#false is used. Takes O(log n) time.

Range queries

(iset-between set least include-least most include-most)

Returns a new set containing the elements of set}}} in the interval between {{{least}}} and {{{most}}}. If {{{include-least}}}/{{{include-most}}} is non-{{{#false}}} then the result includes an element equal to {{{least}}}/{{{most respectively; otherwise those elements are not included. Takes O(k log k + log n) time, where n is the number of elements in the set and k is the number of elements returned; O(k + log n) is optimal.

(iset-range< set x) (iset-range<= set x) (iset-range= set x) (iset-range>= set x) (iset-range> set x)

Returns a new set containing only the elements of set}}} that are less/less-or-equal/equal/greater-or-equal/greater than {{{x. Takes O(k log k + log n) time, where n is the number of elements in the set and k is the number of elements returned; O(k + log n) is optimal.

Note that since set elements are unique, iset-range= returns a set of at most one element.

Filter, fold, map

(iset-filter predicate? set)

Returns a new set containing only those elements x}}} for which {{{(predicate? x)}}} returns non-{{{#false. Takes O(n log n) time; O(n) is optimal.

(iset-fold proc base set)

The fundamental set iterator. Equivalent to, but may be more efficient than,

(fold proc base (iset->ordered-list set))

Takes O(n) time.

(iset-map/monotone proc set [precedes?])

Returns a new set containing the elements (proc x)}}} for every {{{x}}} in {{{set}}}. {{{proc}}} must be a ''monotone'' unary procedure that preserves the order of set elements. Observe that mapping a set of unique elements with a monotone function yields a new set of unique elements, so element uniqueness follows from the monotonicity assumption. If {{{precedes?}}} is given, it is the order predicate of the new set; otherwise the new set uses the same order predicate as {{{set. Takes O(n) time.

(iset-map proc set [precedes?])

As iset-map/monotone}}}, except the {{{proc is not required to be monotone. When multiple elements are mapped to the same value, only the value originating from the greatest unmapped element is retained. O(n log n) time.

Conversion

(iset->ordered-list set)

Returns a list containing the elements of set in increasing order. O(n) time.

(ordered-list->iset precedes? list)

Returns a set containing the elements of list}}} and using order predicate {{{precedes?}}}. It is an error for {{{list to be anything other than a proper list of elements in increasing order. O(n log n) time; O(n) is optimal.

(list->iset precedes? list)

Returns a set containing the elements of list}}} and using order predicate {{{precedes?}}}. {{{list}}} must be a proper list, but may contain duplicates and need not be in order. When a subset of input elements are equal according to {{{precedes?, only the last occurrence is retained. O(n log n) time.

Element comparison

(iset-element<? set x y) (iset-element=? set x y) (iset-element>? set x y)

These procedures compare x}}} to {{{y}}} according to {{{set}}}'s order predicate and the conventional meaning of {{{<}}}, {{{=}}}, and {{{>.

Maps

A map data structure stores key-value associations from a set of keys with an order predicate to arbitrary value objects. It is an alternative to an association list or hash table, which also store key-value associations, but with different key constraints and efficiency guarantees.

In the same way that a list of key-value dotted pairs can implement an association list, a set of key-value dotted pairs can implement a map. Implementations may use this approach, or may implement a distinct data structure specific to maps.

Construction

Map-wide queries

(imap? x) (imap-empty? map) (imap-size map) (imap-order-predicate map)

The behavior and efficiency of these operations is the same as the analogous set procedures.

Set-theoretic operations

(imap-union map...) (imap-intersection map...) (imap-difference map...) (imap-xor map1 map2)

The behavior and efficiency of these operations is the same as the analogous set procedures.

Element queries

(imap-member? map key) (imap-min-key map) (imap-max-key map)

The behavior and efficiency of these operations is the same as the analogous set procedures.

(imap-min-value map) (imap-max-value map)

Returns the value associated with (imap-min-key map)}}} and {{{(imap-max-key map) respectively. O(log n) time; O(1) is optimal.

Element operations

(imap-update set key present-proc absent-proc)

A continuation-based universal update procedure. Attempts to find a key key-match}}} equal to {{{key}}} in {{{set}}}. When such a {{{key-match is found, calls

(present-proc key-match value update remove)

which must either tail-call

(update new-key new-value ret)

to replace key}}} with {{{new-key}}} and the corresponding value with {{{new-value, or tail-call

(remove ret)

to remove the assocition from map}}}. In either case {{{ret}}} will be one of the values returned from the enclosing {{{imap-update call.

When no such key-match}}} is found, {{{imap-update calls

(absent-proc insert ignore)

which should either tail-call

(insert new-key value ret)

to insert new-key}}} and {{{value}}} into {{{map, or else tail-call

(ignore ret)

to leave map}}} unchanged. Again, {{{ret}}} will be returned from the enclosing {{{imap-update call.

In any case, imap-update returns

(values new-map ret)

where new-map}}} is a new map reflecting the indicated modification, if any; and {{{ret is the return value produced by one of the continuations. O(log n) time.

(imap-find map key [absent-thunk])

Returns the value associated with key}}} in {{{map}}}, or the result of evaluating {{{(absent-thunk)}}} if no such value exists. If {{{absent-thunk}}} is unspecified then a default procedure that always returns {{{#false is used. O(log n) time.

(imap-include map key value) (imap-exclude map key)

Return a new set that certainly does/doesn't include an association from key}}}. {{{imap-include}}} will replace any prior association from {{{key}}} that might exist in {{{map. O(log n) time.

(imap-key-predecessor map key [absent-thunk]) (imap-key-successor map key [absent-thunk])

The behavior and efficiency of these operations is the same as the analogous set procedures.

Range queries

(imap-between map least include-least most include-most)

The behavior and efficiency of these operations is the same as the analogous set procedures. (Least and most are keys.)

(imap-range< map key) (imap-range<= map key) (imap-range= map key) (imap-range>= map key) (imap-range> map key)

The behavior and efficiency of these operations is the same as the analogous set procedures.

Filter, fold, map

(imap-filter proc set)

Returns a new set containing only those associations for which (proc key value)}}} returns non-{{{#false. O(n log n) time; O(n) is optimal.

(imap-fold proc base map)

The fundamental map iterator. Calls

(proc key value accum)

successively to accumulate a value initialized to base. O(n) time.

(imap-map-values proc map)

Returns a new map where each value is replaced with the result of evaluating

(proc key value)

for that value's key. Takes O(n) time.

(imap-map/monotone proc map [precedes?])

Returns a new map containing the association values returned by calls to

(proc key value)

for each key-value association in map}}}. {{{proc}}} must return {{{(values new-key new-value)}}}. The key mapping must be a ''monotone,'' preserving the order and uniqueness of keys. The value mapping need not be monotone. If {{{precedes?}}} is given, it is the key order predicate of the new map; otherwise the new map uses the same order predicate as {{{map. Takes O(n) time.

(imap-map proc map [precedes?])

As imap-map/monotone}}}, except the {{{proc}}} is not required to be monotone. When multiple keys are mapped to the same {{{new-key, only the association originating from the greatest pre-mapped key is retained. O(n log n) time.

Conversion

(imap->ordered-alist map)

Returns an association list containing the associations of map in increasing order by key. O(n) time.

(imap-keys map)

Returns a set containing the keys of map. O(n log n) time; O(n) is optimal.

(imap-values map)

Returns a list containing the values of map in increasing order by key. O(n) time.

(ordered-alist->imap precedes? alist)

Returns a map containing the associations of alist}}} and using order predicate {{{precedes?}}}. It is an error for {{{list to be anything other than a proper association list in increasing order by key. O(n log n) time; O(n) is optimal.

(alist->imap precedes? alist)

The behavior and efficiency of these operations is the same as the analogous set procedures.

Key comparison

(imap-key<? map x y) (imap-key=? map x y) (imap-key>? map x y)

The behavior and efficiency of these operations is the same as the analogous set procedures.

Open questions

John Cowan's comments

To start with, I basically think this proposal is an excellent start, providing facilities that Scheme programmers should have easily available to them.

I think it is very important that these proposals all be fleshed out to something the size of SRFI 1 and SRFI 113. Users should not wind up choosing between one data structure and another by which one has a more convenient API. If the API is as consistent as possible in all proposals, it's a big win for usability. I do not mean that absolutely every SRFI 1 feature is needed (in particular, I don't see a need for association deque support), but that these APIs seem much too small to fit with the evolving style of the large language; in the words of the charter, to be "large enough to address the practical needs of mainstream software development". Keeping it uniform also helps with human-memory issues. See ContainerSrfiComparison for something close to what I have in mind.

Comments on ideques

I will be putting together a proposal for mutable deques (aka tconc lists) at some point, no doubt closely based on this proposal and SRFI 1.

Comments on isets and imaps

It's premature to do so yet, but I think this proposal should be prepared to incorporate ComparatorsCowan when it becomes a draft SRFI. This would mean accepting comparators wherever a precedence (<) function is expected. I plan to convert SRFI 113, HashTablesCowan, and whatever we use for a sort package (which I hope will be a revitalized SRFI 32) to this style.

I'm not opposed to the LRU convention, but I'd like to see it justified. If I'm convinced, I'll adopt it for SRFI 113 too -- again, uniformity is a win. Currently the spec says nothing and the implementation uses first-addition-wins. Terminologically, I think it should be "least recently added" rather than "used", or better yet speak of a "most recently added retention policy" rather than a removal policy.

Take a look at the explicit merger procedures of SRFI 113 for union and intersection.

Comments on open issues

I think immutable pairs/lists should be a separate proposal.

I don't think the imap/IMAP naming conflict is important at all.

Update procedure draft 2

(imap-update map key absent-proc present-proc)

A continuation-based universal update procedure. Attempts to find a key key-match}}} equal to {{{key}}} in {{{map}}}, and its associated {{{value}}}. When such no such {{{key-match is found, returns the result of calling

(absent-proc insert)

. insert is a procedure such that

(insert new-value)

returns a version of map}}} in which {{{key}}} is associated with {{{new-value.

When key-match exists, returns the result of calling

(present-proc key-match value replace remove)

. replace is a procedure such that

(replace new-value)

returns a new version of map}}} in which the {{{key}}} maps to {{{new-value}}} instead of {{{value.

remove is a thunk such that

(remove)

returns a new version of map}}} with no association for {{{key.

The imap-update}}}, {{{insert}}}, {{{replace}}}, and {{{remove}}} procedures may take O(log n) time each. Thus if {{{absent-proc}}} and {{{present-proc take constant (or even O(log n)) time and call their passed-in continuations at most once (or even any constant number of times), the entire update process takes O(log n) time.