This site is a static rendering of the Trac instance that was used by R7RS-WG1 for its work on R7RS-small (PDF), which was ratified in 2013. For more information, see Home. For a version of this page that may be more recent, see CharacterSpansCowan in WG2's repo for R7RS-large.

Character­Spans­Cowan

cowan
2015-03-21 04:12:53
9history
source

Character span library

This is a library for manipulating textual content based on character spans, also known as just spans. These are references to a string and two delimiting string cursors, which are references to positions within a string. Character spans are immutable, and except as noted below, it is an error to mutate the string that underlies a span.

Issues

  1. Is character span a satisfactory name?

2-6. (resolved)

  1. Currently span-mismatch returns a cursor into its second argument. Should it return a cursor into the first argument, or into both arguments? (In Chibi it makes no difference.)

8-10. (resolved)

Rationale

When SRFI 13 was defined in 1999, it was intended to provide efficient string operations on both whole strings and substrings. At that time, only Guile and T provided true shared substrings, and SRFI 13 could not reasonably require them of a Scheme implementation. Consequently, almost all the SRFI 13 procedures accept optional start and end arguments for each of the string arguments, indexing the beginning and the end of the substring(s) to be operated on.

Unfortunately, variable-arity procedures are often slow and may not interact well with type checking in Schemes that provide it. In addition, it is now fairly common to store strings internally as UTF-8 or UTF-16 code unit sequences, which means that indexing operations are often O(n) rather than O(1), and string mutation can be extremely expensive if the storage used for the string needs to be expanded and the implementation does not use an indirect pointer to it (as in Chicken).

As for shared substrings, they are no more common in 2015 than they were in 1999. (Guile actually provides not only shared substrings but also copy-on-write and fully immutable strings.) Fortunately, since then it has become normal for Schemes to provide user-defined records, and they are required by both R6RS and R7RS. This makes it possible to portably provide a representation for a segment of a string, provided the string is never mutated. The most portable such record consists of a string and two indexes, but other more efficient representations may be used instead.

This proposal, therefore, is intended to help move the practice of Scheme programming away from mutable strings, string indexes, and SRFI 13, while maintaining some degree of conceptual backward compatibility. It does not require any particular run-time efficiencies from its procedures.

The operations provided here (with the exception of those in the Case and Comparisons sections) are entirely independent of the character repertoire supported by the implementation. In particular, this means that the case-insensitive procedures of SRFI 13 are excluded. There is also no provision for R6RS normalization procedures or for an string->integer procedure that was proposed for SRFI 13 but not included. These may appear in future SRFIs.

Specification

Cursors are pointers into strings or spans, and are not necessarily disjoint from other Scheme types. For example, they may be exact integers that are character-based indexes into strings. Alternatively, in an implementation whose internal representation of strings is UTF-8, string cursors may be indexes of individual bytes in the string. It is also possible to implement string cursors as objects of a disjoint type. The string cursor procedures of this proposal are mostly taken from Chibi Scheme.

Given a span of length n, there are n+2 possible cursors that refer to it: one for each character in the span, one for the position just before the first character (known as the "pre-start cursor"), and one for the position just after the last character (known as the "post-end cursor"). These additional positions are provided for backward and forward iteration respectively, and also because when creating a span from cursors the second cursor argument is exclusive. It is an error if a cursor argument is not one of the possible cursors for the string or span argument.

Note: Procedures with similar names and purposes to R7RS-small or SRFI 13 procedures are marked [R7RS-small] and [SRFI 13] respectively.

All predicates passed to procedures defined in this proposal may be called in any order and any number of times, except as otherwise noted. In SRFI 13, there is no such provision, and so character sets are inherently more efficient than predicates because testing them is fast and free of side effects, though how fast character sets are if they support full Unicode is very implementation-dependent.

For security and efficiency reasons, there is no operation to retrieve the underlying string from a character span.

String and span cursors

These operations exist in pairs, as they apply to both strings and spans.

(string-cursor-start string)

(span-cursor-start span)

Returns a cursor referring to the first character in string or span.

(string-cursor-end string)

(span-cursor-end span)

Returns the post-end cursor of string or span.

(string-cursor-ref string cursor)

(span-cursor-ref span cursor)

Returns the character referred to by cursor. It is an error if cursor is the pre-start or post-end cursor of string or span.

(string-cursor-next string cursor)

(span-cursor-next span cursor)

Returns the cursor following cursor. It is an error if cursor is the post-end cursor of string or span.

(string-cursor-prev string cursor)

(span-cursor-prev span cursor)

Returns the cursor preceding cursor. It is an error if cursor is the pre-start cursor of string or span.

(string-cursor-forward string cursor n)

(span-cursor-forward span cursor n)

(string-cursor-backward string cursor n)

(span-cursor-backward span cursor n)

Iterates string-cursor-next or string-cursor-prev n times.

(string-cursor=? string cursor1 cursor2)

(span-cursor=? span cursor1 cursor2)

(string-cursor<? string cursor1 cursor2)

(span-cursor<? span cursor1 cursor2)

(string-cursor>? string cursor1 cursor2)

(span-cursor>? span cursor1 cursor2)

(string-cursor<=? string cursor1 cursor2)

(span-cursor<=? span cursor1 cursor2)

(string-cursor>=? string cursor1 cursor2)

(span-cursor>=? span cursor1 cursor2)

Compare cursor1 and cursor2.

(string-cursor->index string cursor)

(span-cursor->index span cursor)

Return the character index into string or span corresponding to cursor. If cursor is the pre-start cursor of string or span, -1 is returned. If cursor is the post-end cursor of string or span, the length of string or span is returned.

(string-index->cursor string index)

(span-index->cursor span index)

Returns the cursor referring to string or span that corresponds to index. If index is -1, returns the pre-start cursor of string or span.If index is equal to the length of string or span, returns the post-end cursor of string or span.

(string-cursor-difference string cursor1 cursor2)

(span-cursor-difference span cursor1 cursor2)

Return the difference in characters between cursor2 and cursor1. The arguments are in reverse order from plain - so that they can be passed as start and end cursors respectively.

Span constructors

(make-whole-span string)

Returns a character span which contains all the characters of string in order.

(make-span string start end)

Returns a character span which contains the characters of string in order from indexes start (inclusive) to end (exclusive).

(span char ...) [R7RS-small]

Returns a character span which contains the characters char in order.

(span-transform proc span obj ...)

Proc is a procedure which accepts a string as its first argument and returns a string. It is invoked on a string which contains the characters of span in order plus the obj arguments, if any. The resulting string is returned as a character span by span-transform. This procedure allows string-based procedures to be easily used in a context that provides and expects spans.

(span-unfold stop? mapper successor seed)

(span-unfold-right stop? mapper successor seed)

Returns a span whose characters are generated in forward/reverse order using the following algorithm, initializing x to seed: If the result of applying the predicate stop? to x is true, the span is complete and is returned. Otherwise, apply the procedure mapper to x. The value that mapper returns becomes the next character of the span. Then a new value of x is obtained by applying the procedure successor to x, and this algorithm is repeated.

(span-tabulate len proc) [SRFI 13]

Invokes proc for all exact integers between 0 (inclusive) and len (exclusive), and returns a span containing the characters returned by the invocations in order.

Compatibility note: The argument order here agrees with the list-tabulate procedure of SRFI 1 rather than SRFI 13's string-tabulate procedure. The discrepancy was unintentional, but was discovered too late to fix.

Predicates

(span? obj) [R7RS-small]

Returns #t if obj is a character span, and #f otherwise.

(span-null? span) [SRFI 13]

Returns #t if span contains zero characters, and #f otherwise.

(span-every? pred span) [SRFI 13]

Returns #t if pred returns true for every character in span, and #f otherwise.

(span-any? pred span) [SRFI 13]

Returns #f if pred returns false for each character in span, and #t otherwise.

Selection

(span-ref span k) [R7RS-small]

Returns the kth character of span, starting with 0. It is an error if k is not a non-negative exact integer less than the length of span.

(span-take span n) [SRFI 13]

Returns a character span which contains the first n characters of span.

(span-take-right span n) [SRFI 13]

Returns a character span which contains the last n characters of span.

(span-drop span n) [SRFI 13]

Returns a character span which contains all but the first n characters of span.

(span-drop-right span n) [SRFI 13]

Returns a character span which contains all but the last n characters of span.

(span-split-at span n) [SRFI 13]

Returns two values, a character span containing the first n characters of span, and another character span containing the remaining characters of span.

(span-replicate span from to)

Span is conceptually replicated an infinite number of times to both left and right, and this doubly infinite sequence is then truncated to form a span starting at index from (inclusive) and ending at index to (exclusive). Negative indexes are allowed in order to access the infinite left extension.

Compatibility note: This procedure is analogous the SRFI 13 procedure xsubstring, except that the to argument is required.

Examples:

(span-replicate (span-whole-string "abcdef") 2 7) => span containing "cdefab" ; rotate left (span-replicate (span-whole-string "abcdef") -2 4) => span containing "efabcd" ; rotate right (span-replicate (span-whole-string "abc") 0 7) => span containing "abcabca" ; replicate

(subspan span start end) [R7RS-small]

(subspan/cursors span start end)

Returns a character span which contains the characters in span between the indexes/cursors start (inclusive) and end (exclusive).

Padding, trimming, and compressing

(span-pad span len [ char ]) [SRFI 13]

(span-pad-right span len [ char ]) [SRFI 13]

Returns a span of length len consisting of span padded on the left/right by as many occurrences of the character char as needed. If span has more than len characters, it is truncated on the left (right) to length len. If char is omitted, #\space is used.

(span-trim span [ pred ]) [SRFI 13]

(span-trim-right span [ pred ]) [SRFI 13]

(span-trim-both span [ pred ]) [SRFI 13]

Trim span by skipping over all characters on the left / on the right / on both sides that satisfy pred and returning the resulting span. Pred defaults to (lambda (x) (eqv? x #\space)).

(span-compress span [ char ])

Return a span which differs from span in that every sequence of consecutive occurrences of char has been replaced by a single char. If char is omitted, #\space is used.

Prefixes and suffixes

(span-prefix span1 span2)

(span-suffix span1 span2)

Returns a span containing the characters in the common prefix/suffix of span1 and span2. If there is no common prefix/suffix, returns an empty span.

(span-prefix-length span1 span2) [SRFI 13]

(span-suffix-length span1 span2) [SRFI 13]

Returns the length (in characters) of the span that would be returned by span-prefix and friends.

(span-mismatch span1 span2)

Returns a cursor referring to the first character in span2 that is not equal to the corresponding character in span1. If the spans are equal, the post-end cursor of span2 is returned. If span2 is empty, then the pre-start cursor of span2 is returned.

(span-mismatch-right span1 span2)

Returns a cursor referring to the first character in span2, scanning in reverse order, that is not equal to the corresponding character in span1, also processed in reverse order. If the spans are equal, the pre-start cursor of span2 is returned. If span2 is empty, then the post-end cursor of span2 is returned.

(span-prefix? span1 span2) [SRFI 13]

(span-suffix? span1 span2) [SRFI 13]

Returns #t if span1 is a prefix/suffix of span2, and #f otherwise.

Searching

(span-count pred span) [SRFI 13]

Returns the number of characters in span which satisfy pred as an exact integer.

(span-find pred span)

Returns a cursor referring to the first character in span that satisfies pred, or the cursor referring to the position after the last character if there is none.

(span-find-right pred span)

Returns a cursor referring to the first character in span that satisfies pred, processing span in reverse order, or the pre-start cursor of span if there is none.

Compatibility note: These procedures are analogous to SRFI 13's string-index procedures, but return cursors rather than indexes or #f on failure.

(span-skip pred span) [SRFI 13]

(span-skip-right pred span) [SRFI 13]

Returns a cursor pointing to the first/last character in span that does not satisfy pred, or the post-end/pre-start cursor if there is none.

(span-take-while pred span)

Returns the longest initial prefix of span whose characters all satisfy pred.

(span-drop-while pred span)

Drops the longest initial prefix of span whose characters all satisfy pred, and returns the rest of span.

(span-span pred span)

(span-break pred span)

The span-span procedure splits span, returning two values: the longest initial prefix whose characters all satisfy pred, and the remainder of span. The span-break procedure inverts the sense of the predicate: the remainder commences with the first character of span that satisfies pred. In other words: span-span finds the intial span of characters satisfying pred, and span-break breaks span at the first character satisfying pred.

The name "span-span" is unfortunate but unavoidable.

(span-contains haystack needle) [SRFI 13]

It is an error if needle and haystack are not both spans. Returns a cursor referring to haystack indicating the first position of the first subspan in which the characters of needle appear. If there is no such position (in particular, if either needle or haystack is empty), #f is returned.

The whole character span or string

(span-length span) [SRFI 13]

Returns the number of characters in span.

(span-reverse span) [SRFI 13]

Returns a span containing the characters of span in reverse order.

(span-append span ...) [SRFI 13]

Returns a span containing the characters of the spans in order.

(span-concatenate list) [SRFI 13]

Returns a span containing the characters of the spans and/or strings enumerated in list in order. This procedure will succeed even if (apply span-append list-of-spans) fails due to an implementation limit on the number of arguments a procedure may receive.

(span-concatenate-reverse list) [SRFI 13]

The same as span-concatenate, except that list is processed in reverse order. Note that the individual spans are not processed in reverse order.

This procedure is useful in the construction of procedures that accumulate character data into lists of string buffers, and wish to convert the accumulated data into a single string when done.

Folding and mapping

(span-map proc span ...) [SRFI 13]

It is an error if proc does not accept as many arguments as there are spans and return a single character.

Applies proc element-wise to the characters of the spans and returns a span of the results, in order. If more than one span is given and not all spans have the same length, span-map terminates when the shortest span runs out. The dynamic order in which proc is applied to the characters of the spans is unspecified. If multiple returns occur from span-map, the values returned by earlier returns are not mutated.

(span-for-each proc span ...) [SRFI 13]

It is an error if proc does not accept as many arguments as there are strings.

The arguments to span-for-each are like the arguments to span-map, but it calls proc for its side effects rather than for its values. Unlike span-map, this procedure is guaranteed to call proc exactly once on each of the characters of the spans in order from the first character(s) to the last. The value returned is unspecified. If more than one span is given and not all spans have the same length, span-for-each terminates when the shortest string runs out.

(span-fold proc nil span ...) [SRFI 13]

(span-fold-right proc nil span ...) [SRFI 13]

Invokes proc on each character of the spans in forward/reverse order, passing the result of the previous invocation as an additional argument. For the first invocation, nil is used as the additional argument. Returns the result of the last invocation, or nil if there was no invocation. It is an error if proc does not accept the same number of arguments as there are spans plus one.

Parsing and unparsing

(span-split span [separator [ limit ] ])

Returns a list of the words contained in span. If separator (which is also a character span) is omitted, then the words are separated by one or more whitespace characters (those on which char-whitespace? returns #t). If separator is supplied, it specifies a span whose characters are to be used as the word separator. The returned list will then have one more item than the number of non-overlapping occurrences of the separator in the string. If separator is an empty span, then the returned list contains a list of spans, each of which contains a single character.

If limit is provided, at most that many splits occur, and the remainder of span is returned as the final element of the list (thus, the result will have at most limit+1 elements). If limit is not specified, then as many splits as possible are made. It is an error if limit is not a positive exact integer.

(span-join list [ delim [ grammar ] ]) [SRFI 13]

This procedure pastes the elements of list together using delimiter, which is a span. For convenience, list elements are allowed to be strings or spans or both. If delimiter is omitted, it is a single space. The grammar argument is a symbol that determines how the delimiter is used, and defaults to infix. The following values are understood:

Filtering and partitioning

(span-filter pred span) [SRFI 13]

Returns a span containing the characters of span which satisfy pred.

(span-remove pred span)

Returns a span containing the characters of span which do not satisfy pred.

Compatibility note: The SRFI 13 version of this procedure is called string-delete, which is inconsistent with the conventions of SRFI 1 and other SRFIs.

(span-partition pred span)

Returns two values, a span containing the characters of span which satisfy pred and another span containing those which do not.

Conversion

(string->span string)

Returns a span that contains the characters of string in order. Later mutation of string will not affect the value of span.

(span->string span)

Returns a newly allocated string which contains the characters of span in order.

(span->list span) [R7RS-small]

(span->vector span) [R7RS-small]

Returns a newly allocated list/vector containing the characters of span in order.

(list->span list) [R7RS-small]

(vector->span vector) [R7RS-small]

Returns a span containing the elements of list or vector in order. It is an error if any of the elements are not characters.

(reverse-list->span list) [SRFI 13]

Returns a span whose characters are the elements of list in reverse order. It is an error if the elements are not characters.

Case

(span-upcase span) [R7RS-small]

(span-downcase span) [R7RS-small]

(span-foldcase span) [R7RS-small]

In any implementation of this proposal based on R7RS, these procedures must behave analogously to the corresponding string procedures. That is, if a call to string procedure x (such as string-upcase or string=? on a string containing characters y0 ... yn produces a string containing characters z0 ... zn, then a call to the analogous span procedure x′ on a span containing characters y0 ... yn must produce a span containing characters z0 ... zn.

Comparison

(span=? span1 span2 span ...) [R7RS-small]

(span<? span1 span2 span ...) [R7RS-small]

(span>? span1 span2 span ...) [R7RS-small]

(span<=? span1 span2 span ...) [R7RS-small]

(span>=? span1 span2 span ...) [R7RS-small]

(span-ci=? span1 span2 span ...) [R7RS-small]

(span-ci<? span1 span2 span ...) [R7RS-small]

(span-ci>? span1 span2 span ...) [R7RS-small]

(span-ci<=? span1 span2 span ...) [R7RS-small]

(span-ci>=? span1 span2 span ...) [R7RS-small]

In any implementation of this proposal based on R7RS, the results of the span procedures must behave as if the arguments were converted to strings and then passed to the corresponding string procedures.

Comparator

span-comparator

This is a variable containing a SRFI 114 comparator for comparing strings. Its procedures behave as if their arguments are converted to strings and then passed to the procedures of string-comparator.

Sample Implementation

The sample implementation (which is not yet written) represents spans as records containing a string and two string cursors, and provides two implementations of string cursors, one using string indexes directly and one that layers UTF-8 character spans on top of single-byte native strings.

In addition, it provides a library with the same API as this proposal, but operating on strings rather than spans and using string- in their names instead of span-. This library can assist in converting SRFI-13-based programs that do not exploit the substring facilities. There is also a shim for R5RS systems that do not yet provide the full R7RS-small string API.