This site is a static rendering of the Trac instance that was used by R7RS-WG1 for its work on R7RS-small (PDF), which was ratified in 2013. For more information, see Home.

Source for wiki CharacterSpansCowan version 7

author

cowan

comment

ipnr

127.11.51.1

name

CharacterSpansCowan

readonly

text

== Character span library ==

This is a library for manipulating textual content based on ''character spans'', also known as just ''spans''.  These are references a string and two delimiting ''string cursors'', which are references to the positions within a string.  Character spans are immutable, and except as noted below, it is an error to mutate the string that underlies a span.  

== Issues ==

1. Is ''character span'' a satisfactory name?

2-6. (resolved)

7. Currently `span-mismatch` returns a cursor into its second argument.  Should it return a cursor into the first argument, or into both arguments?  (In Chibi it makes no difference.)

8-10.  (resolved)

== Rationale ==

When SRFI 13 was defined in 1999, it was intended to provide efficient string operations on both whole strings and substrings.  At that time, only Guile and T provided true shared substrings, and SRFI 13 could not reasonably require them of a Scheme implementation.  Consequently, almost all the SRFI 13 procedures accept optional ''start'' and ''end'' arguments for each of the string arguments, indexing the beginning and the end of the substring(s) to be operated on.

Unfortunately, variable-arity procedures are often slow and may not interact well with type checking in Schemes that provide it.  In addition, it is now fairly common to store strings internally as UTF-8 or UTF-16 code unit sequences, which means that indexing operations are often O(n) rather than O(1), and string mutation can be extremely expensive if the storage used for the string needs to be expanded and the implementation does not use an indirect pointer to it (as in Chicken).

As for shared substrings, they are no more common in 2015 than they were in 1999.  (Guile actually provides not only shared substrings but also copy-on-write and fully immutable strings.)  Fortunately, since then it has become normal for Schemes to provide user-defined records, and they are required by both R6RS and R7RS.  This makes it possible to portably provide a representation for a segment of a string, provided the string is never mutated.  The most portable such record consists of a string and two indexes, but other more efficient representations may be used instead.

This proposal, therefore, is intended to help move the practice of Scheme programming away from mutable strings, string indexes, and SRFI 13, while maintaining some degree of conceptual backward compatibility.  It does not require any particular run-time efficiencies from its procedures.

The operations provided here (with the exception of those in the Case and Comparisons sections) are entirely independent of the character repertoire supported by the implementation.  In particular, this means that the case-insensitive procedures of SRFI 13 are excluded.  There is also no provision for [http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-2.html#node_idx_54 R6RS normalization procedures] or for an `string->integer` procedure that was proposed for SRFI 13 but not included.  These may appear in future SRFIs.

== Specification ==

String cursors are pointers into strings or spans, and are not necessarily disjoint from other Scheme types.  For example, they may be exact integers that are character-based indexes into strings.  Alternatively, in an implementation whose internal representation of strings is UTF-8, string cursors may be indexes of individual bytes in the string.  It is also possible to implement string cursors as objects of a disjoint type.  The string cursor procedures of this proposal are mostly taken from Chibi Scheme.

Given a span of length ''n'', there are ''n''+2 possible cursors that refer to it: one for each character in the span, one for the position just before the first character, and one for the position just after the last character.  These additional positions are provided for backward and forward iteration respectively, and also because when creating a span from cursors the second cursor argument is exclusive.

Note: Procedures with similar names and purposes to R7RS-small or [http://srfi.schemers.org/srfi-13/srfi-13.html SRFI 13] procedures are marked [R7RS-small] and [SRFI 13] respectively.

All predicates passed to procedures defined in this proposal may be called in any order and any number of times, except as otherwise noted.  In SRFI 13, there is no such provision, and so character sets are inherently more efficient than predicates [http://srfi.schemers.org/srfi-13/mail-archive/msg00052.html because testing them is fast and free of side effects], though how fast character sets are if they support full Unicode is very implementation-dependent.

For security and efficiency reasons, there is no operation to retrieve the underlying string from a character span.

== String cursors ==

These operations exist in pairs, as they apply to both strings and spans.

`(string-cursor-start `''string''`)`

`(span-cursor-start `''span''`)`

Returns a cursor referring to the first character in ''string'' or ''span''.

`(string-cursor-end `''string''`)`

`(span-cursor-end `''span''`)`

Returns a cursor referring to the position following the last character in ''string'' or ''span''.

`(string-cursor-ref `''string cursor''`)`

`(span-cursor-ref `''span cursor''`)`

Returns the character referred to by ''cursor''.  It is an error if ''cursor'' does not refer to a character in ''string'' or ''span''.

`(string-cursor-next `''string cursor''`)`

`(span-cursor-next `''span cursor''`)`

Returns the cursor following ''cursor''.  It is an error if ''cursor'' does not refer to a character in ''string'' or ''span''.

`(string-cursor-prev `''string cursor''`)`

`(span-cursor-prev `''span cursor''`)`

Returns the cursor preceding ''cursor''.  It is an error if ''cursor'' does not refer either to a character in ''string'' or ''span'' or the position following the last character in ''string'' or ''span''. 

`(string-cursor-forward `''string cursor n''`)`

`(span-cursor-forward `''span cursor n''`)`

`(string-cursor-backward `''string cursor n''`)`

`(span-cursor-backward `''span cursor n''`)`

Iterates `string-cursor-next` or `string-cursor-prev` ''n'' times.

`(string-cursor-forward-until `''string cursor pred''`)`

`(span-cursor-forward-until `''span cursor pred''`)`

Iterates `string-cursor-next` until it refers to a character that satisfies ''pred'' or the position following the last character of ''string'' or ''span'' is reached, and returns that cursor.

`(string-cursor-backward-until `''string cursor pred''`)`

`(span-cursor-backward-until `''span cursor pred''`)`

Iterates `span-cursor-prev` until it refers to a character that satisfies ''pred'' or the position preceding the first character of ''string'' or ''span'' is reached, and returns that cursor.

`(string-cursor=? `''string cursor,,1,, cursor,,2,,''`)`

`(span-cursor=? `''span cursor,,1,, cursor,,2,,''`)`

`(string-cursor<? `''string cursor,,1,, cursor,,2,,''`)`

`(span-cursor<? `''span cursor,,1,, cursor,,2,,''`)`

`(string-cursor>? `''string cursor,,1,, cursor,,2,,''`)`

`(span-cursor>? `''span cursor,,1,, cursor,,2,,''`)`

`(string-cursor<=? `''string cursor,,1,, cursor,,2,,''`)`

`(span-cursor<=? `''span cursor,,1,, cursor,,2,,''`)`

`(string-cursor>=? `''string cursor,,1,, cursor,,2,,''`)`

`(span-cursor>=? `''span cursor,,1,, cursor,,2,,''`)`

Compare ''cursor,,1,,'' and ''cursor,,2,,''.  It is an error if the cursors do not refer to positions in ''string'' or ''span''.

`(string-cursor->index `''string cursor''`)`

`(span-cursor->index `''span cursor''`)`

Return the character index into ''string'' or ''span'' corresponding to ''cursor''.  It is an error if ''cursor'' does not refer to a position in ''string'' or ''span''.

`(string-index->cursor `''string index''`)`

`(span-index->cursor `''span index''`)`

Return the cursor referring to ''string'' or ''span'' that corresponds to ''index''.  It is an error if ''index'' is less than zero or greater than the length of ''string'' or ''span''.

`(string-cursor-difference `''string cursor,,1,, cursor,,2,,''`)`

`(span-cursor-difference `''span cursor,,1,, cursor,,2,,''`)`

Return the difference in characters between ''cursor,,2,,'' and ''cursor,,1,,''.  It is an error if the cursors do not refer to positions in ''string'' or ''span''.

== Span constructors ==

`(make-whole-span `''string''`)`

Returns a character span which contains all the characters of ''string'' in order.

`(make-span `''string start end''`)`

Returns a character span which contains the characters of ''string'' in order from indexes ''start'' (inclusive) to ''end'' (exclusive).

`(span `''char'' ...`)` [R7RS-small]

Returns a character span which contains the characters ''char'' in order.

`(span-transform `''proc span obj'' ...`)`

''Proc'' is a procedure which accepts a string as its first argument and returns a string.  It is invoked on a string which contains the characters of ''span'' in order plus the ''obj'' arguments, if any.  The resulting string is returned as a character span by `span-transform`.  This procedure allows string-based procedures to be easily used in a context that provides and expects spans.

`(span-unfold `''stop? mapper successor'' [ ''seed'' ]`)`

`(span-unfold-right `''stop? mapper successor'' ''seed''`)`

Returns a span whose characters are generated in forward/reverse order using the following algorithm, initializing ''x'' to ''seed'': If the result of applying the predicate ''stop?'' to ''x'' is true, the span is complete and is returned.  Otherwise, apply the procedure ''mapper'' to ''x''. The value that ''mapper'' returns becomes the next character of the span.  Then a new value of ''x'' is obtained by applying the procedure ''successor'' to ''x'', and this algorithm is repeated.

`(span-tabulate `''len proc''`)` [SRFI 13]

Invokes ''proc'' for all exact integers between 0 (inclusive) and ''len'' (exclusive), and returns a span containing the characters returned by the invocations in order.

Compatibility note:  The argument order here agrees with the `list-tabulate` procedure of [http://srfi.schemers.org/srfi-1/srfi-1.html SRFI 1] rather than SRFI 13's `string-tabulate` procedure.  The discrepancy was [http://srfi.schemers.org/srfi-13/mail-archive/msg00143.html unintentional], but was [http://srfi.schemers.org/srfi-13/mail-archive/msg00144.html discovered too late to fix].

== Predicates ==

`(span? `''obj''`)` [R7RS-small]

Returns `#t` if ''obj'' is a character span, and `#f` otherwise.

`(span-null? `''span''`)` [SRFI 13]

Returns `#t` if ''span'' contains zero characters, and `#f` otherwise.

`(span-every `''pred span''`)` [SRFI 13]

Returns `#t` if ''pred'' returns true for every character in ''span'', and `#f` otherwise.

`(span-any `''pred span''`)` [SRFI 13]

Returns `#f` if ''pred'' returns false for each character in ''span'', and `#t` otherwise.

== Selection ==

`(span-ref `''span k''`)` [R7RS-small]

Returns the ''k''th character of ''span'', starting with 0.  It is an error if ''k'' is not a non-negative exact integer less than the length of ''span''.

`(span-take `''span n''`)` [SRFI 13]

Returns a character span which contains the first ''n'' characters of ''span''.

`(span-take-right `''span n''`)` [SRFI 13]

Returns a character span which contains the last ''n'' characters of ''span''.

`(span-drop `''span n''`)` [SRFI 13]

Returns a character span which contains all but the first ''n'' characters of ''span''.

`(span-drop-right `''span n''`)` [SRFI 13]

Returns a character span which contains all but the last ''n'' characters of ''span''.

`(span-split-at `''span n''`)` [SRFI 13]

Returns two values, a character span containing the first ''n'' characters of ''span'', and another character span containing the remaining characters of ''span''.

`(span-replicate `''span from to''`)`

''Span'' is conceptually replicated an infinite number of times to both left and right, and this doubly infinite sequence is then truncated to form a span starting at index ''from'' (inclusive) and ending at index ''to'' (exclusive).  Negative indexes are allowed in order to access the infinite left extension.

Compatibility note: This procedure is analogous the SRFI 13 procedure `xsubstring`, except that the ''to'' argument is required.

Examples:

{{{
(span-replicate (span-whole-string "abcdef") 2 7)
      => span containing "cdefab" ; rotate left

(span-replicate (span-whole-string "abcdef") -2 4)
      => span containing "efabcd" ; rotate right

(span-replicate (span-whole-string "abc") 0 7)
      => span containing "abcabca" ; replicate
}}}

`(subspan `''span start end''`)` [R7RS-small]

`(subspan/cursors `''span start end''`)`

Returns a character span which contains the characters in ''span'' between the indexes/cursors ''start'' (inclusive) and ''end'' (exclusive).

== Padding, trimming, and compressing ==

`(span-pad `''span len'' [ ''char'' ]`)` [SRFI 13]

`(span-pad-right `''span len'' [ ''char'' ]`)` [SRFI 13]

Returns a span of length ''len'' consisting of ''span'' padded on the left/right by as many occurrences of the character ''char'' as needed.  If ''span'' has more than ''len'' characters, it is truncated on the left (right) to length ''len''.  If ''char'' is omitted, `#\space` is used.

`(span-trim `''span'' [ ''pred'' ]`)` [SRFI 13]

`(span-trim-right `''span'' [ ''pred'' ]`)` [SRFI 13]

`(span-trim-both `''span'' [ ''pred'' ]`)` [SRFI 13]

Trim ''span'' by skipping over all characters on the left / on the right / on both sides that satisfy ''pred'' and returning the resulting span.  ''Pred'' defaults to `(lambda (x) (eqv? x #\space))`.

`(span-compress `''span'' [ ''char'' ]`)`

Return a span which differs from ''span'' in that every sequence of consecutive occurrences of ''char'' has been replaced by a single ''char''.  If ''char'' is omitted, `#\space` is used.

== Prefixes and suffixes ==

`(span-prefix `''span,,1,,'' ''span,,2,,''`)`

`(span-suffix `''span,,1,,'' ''span,,2,,''`)`

Returns a span containing the characters in the common prefix/suffix of ''span,,1,,'' and ''span,,2,,''.  If there is no common prefix/suffix, returns an empty span.

`(span-prefix-length `''span,,1,,'' ''span,,2,,''`)` [SRFI 13]

`(span-suffix-length `''span,,1,,'' ''span,,2,,''`)` [SRFI 13]

Returns the length (in characters) of the span that would be returned by `span-prefix` and friends.

`(span-mismatch `''span,,1,,'' ''span,,2,,''`)`

Returns a cursor referring to the first character in ''span,,2,,'' that is not equal to the corresponding character in ''span,,1,,''.  If the spans are equal, the cursor referring to the position after the last character in ''span,,2,,'' is returned.

`(span-mismatch-right `''span,,1,,'' ''span,,2,,''`)`

Returns a cursor referring to the first character in ''span,,2,,'', scanning in reverse order, that is not equal to the corresponding character in ''span,,1,,'', also processed in reverse order.  If the spans are equal, the cursor referring to the position before the first character in ''span,,2,,'' is returned.

`(span-prefix? `''span,,1,,'' ''span,,2,,''`)` [SRFI 13]

`(span-suffix? `''span,,1,, span,,2,,''`)` [SRFI 13]

Returns `#t` if ''span,,1,,'' is a prefix/suffix of ''span,,2,,'', and `#f` otherwise.

== Searching ==

`(span-count `''pred span''`)` [SRFI 13]

Returns the number of characters in ''span'' which satisfy ''pred'' as an exact integer.

`(span-find `''pred span''`)`

Returns a cursor referring to the first character in ''span'' that satisfies ''pred'', or the cursor referring to the position after the last character if there is none.

`(span-find-right `''pred span''`)`

Returns a cursor referring to the first character in ''span'' that satisfies ''pred'', processing ''span'' in reverse order, or the cursor referring to the position before the first character if there is none.

Compatibility note:  These procedures are analogous to SRFI 13's `string-index` procedures, but return cursors rather than indexes or `#f` on failure.

`(span-skip `''pred span''`)` [SRFI 13]

`(span-skip-right `''pred span''`)` [SRFI 13]

Returns a cursor pointing to the first/last character in ''span'' that does not satisfy ''pred'', or the end/start cursor if there is none.

`(span-take-while `''pred span''`)`

Returns the longest initial prefix of ''span'' whose characters all satisfy ''pred''. 

`(span-drop-while `''pred span''`)`

Drops the longest initial prefix of ''span'' whose characters all satisfy ''pred'', and returns the rest of ''span''. 

`(span-span `''pred span''`)`

`(span-break `''pred span''`)`

The `span-span` procedure splits ''span'', returning two values: the longest initial prefix whose characters all satisfy ''pred'', and the remainder of ''span''.  The `span-break` procedure inverts the sense of the predicate: the remainder commences with the first character of ''span'' that satisfies ''pred''.  In other words: `span-span` finds the intial span of characters satisfying ''pred'', and `span-break` breaks ''span'' at the first character satisfying ''pred''.

The name "span-span" is unfortunate but unavoidable.

`(span-contains `''haystack needle''`)` [SRFI 13]

It is an error if ''needle'' and ''haystack'' are not both spans.  Returns a cursor referring to ''haystack'' indicating the first position in which the characters of ''needle'' appear.  If there is no such position, `#f` is returned.

== The whole character span or string ==

`(span-length `''span''`)` [SRFI 13]

Returns the number of characters in ''span''.

`(span-reverse `''span''`)` [SRFI 13]

Returns a span containing the characters of ''span'' in reverse order.

`(span-append `''span'' ...`)` [SRFI 13]

Returns a span containing the characters of the ''spans'' in order.

`(span-concatenate `''list''`)` [SRFI 13]

Returns a span containing the characters of the spans and/or strings enumerated in ''list'' in order.  This procedure will succeed even if `(apply span-append list-of-spans)` fails due to an implementation limit on the number of arguments a procedure may receive.

`(span-concatenate-reverse `''list''`)` [SRFI 13]

The same as `span-concatenate`, except that ''list'' is processed in reverse order.  Note that the individual spans are ''not'' processed in reverse order. 

This procedure is useful in the construction of procedures that accumulate character data into lists of string buffers, and wish to convert the accumulated data into a single string when done.
 
== Folding and mapping ==

`(span-map `''proc span'' ...`)` [SRFI 13]

It is an error if ''proc'' does not accept as many arguments as there are spans and return a single character.

Applies ''proc'' element-wise to the characters of the ''spans'' and returns a span of the results,
in order. If more than one span is given and not all spans have the same length, `span-map` terminates
when the shortest span runs out. The dynamic order in which ''proc'' is applied to the characters of the spans is unspecified. If multiple returns occur from `span-map`, the values returned by earlier returns are not mutated.

`(span-for-each `''proc span'' ...`)` [SRFI 13]

It is an error if ''proc'' does not accept as many arguments as there are strings.

The arguments to `span-for-each` are like the arguments to `span-map`, but it calls ''proc'' for its
side effects rather than for its values. Unlike `span-map`, this procedure is guaranteed to call ''proc'' exactly once on each of the characters of the spans in order from the first character(s) to the last.  The value returned is unspecified. If more than one span is given and not all spans have the same length, `span-for-each` terminates when the shortest string runs out.

`(span-fold `''proc nil span'' ...`)` [SRFI 13]

`(span-fold-right `''proc nil span'' ...`)` [SRFI 13]

Invokes ''proc'' on each character of the ''spans'' in forward/reverse order, passing the result of the previous invocation as a second argument. For the first invocation, ''nil'' is used as the second argument. Returns the result of the last invocation, or ''nil'' if there was no invocation.  It is an error if ''proc'' does not accept the same number of arguments as there are spans.

== Parsing and unparsing ==

`(span-split `''span'' [''separator'' [ ''limit'' ] ]`)`

Returns a list of the words contained in ''span''.  If ''separator'' (which is also a character span) is omitted, then the words are separated by one or more whitespace characters (those on which `char-whitespace?` returns `#t`). If ''separator'' is supplied, it specifies a string to be used as the word separator. The returned list will then have one more item than the number of non-overlapping occurrences of the separator in the string.  If ''separator'' is an empty span, then the returned list contains a list of the characters in ''span''.

If ''limit'' is provided, at most that many splits occur, and the remainder of ''span'' is returned as the final element of the list (thus, the result will have at most ''limit''+1 elements). If ''limit'' is not specified, then as many splits as possible are made.  It is an error if ''limit'' is not a positive exact integer.

`(span-join `''list'' [ ''delim'' [ ''grammar'' ] ]`)` [SRFI 13]

This procedure pastes the elements of ''list'' together using ''delimiter'', which is a span.  For convenience, ''list'' elements are allowed to be strings or spans or both.  If ''delimiter'' is omitted, it is a single space.  The ''grammar'' argument is a symbol that determines how the delimiter is used, and defaults to `infix`.  The following values are understood:

 * `infix` means an infix or separator grammar: insert the delimiter between list elements. An empty list will produce an empty string; a singleton list will produce the list's element.
 * `strict-infix` means the same as `infix`, but will signal an error if given an empty list. 
 * `suffix` means a suffix or terminator grammar: insert the delimiter after every list element.
 * `prefix` means a prefix grammar: insert the delimiter before every list element.

== Filtering and partitioning ==

`(span-filter `''pred span''`)` [SRFI 13]

Returns a span containing the characters of ''span'' which satisfy ''pred''.

`(span-remove `''pred span''`)`

Returns a span containing the characters of ''span'' which do not satisfy ''pred''.

Compatibility note:  The SRFI 13 version of this procedure is called `string-delete`, which is inconsistent with the conventions of SRFI 1 and other SRFIs.

`(span-partition `''pred span''`)`

Returns two values, a span containing the characters of ''span'' which satisfy ''pred'' and another span containing those which do not.

== Conversion ==

`(string->span `''string''`)`

Returns a span that contains the characters of ''string'' in order.  Later mutation of ''string'' will not affect the value of ''span''.

`(span->string `''span''`)`

Returns a newly allocated string which contains the characters of ''span'' in order.

`(span->list `''span''`)` [R7RS-small]

`(span->vector `''span''`)` [R7RS-small]

Returns a newly allocated list/vector containing the characters of ''span'' in order.

`(list->span `''list''`)` [R7RS-small]

`(vector->span `''vector''`)` [R7RS-small]

Returns a span containing the elements of ''list'' or ''vector'' in order.  It is an error if any of the elements are not characters.

`(reverse-list->span `''list''`)` [SRFI 13]

Returns a span whose characters are the elements of ''list'' in reverse order.  It is an error if the elements are not characters.

== Case ==

`(span-upcase `''span''`)` [R7RS-small]

`(span-downcase `''span''`)` [R7RS-small]

`(span-foldcase `''span''`)` [R7RS-small]

In any implementation of this proposal based on R7RS, these procedures must behave analogously to the corresponding string procedures.  That is, if a call to string procedure ''x'' (such as `string-upcase` or `string=?` on a string containing characters ''y,,0,, ... y,,n,,'' produces a string containing characters ''z,,0,, ... z,,n,,'', then a call to the analogous span procedure ''x′'' on a span containing characters ''y,,0,, ... y,,n,,'' must produce a span containing characters ''z,,0,, ... z,,n,,''.

== Comparison ==

`(span=? `''span,,1,, span,,2,, span'' ...`)` [R7RS-small]

`(span<? `''span,,1,, span,,2,, span'' ...`)` [R7RS-small]

`(span>? `''span,,1,, span,,2,, span'' ...`)` [R7RS-small]

`(span<=? `''span,,1,, span,,2,, span'' ...`)` [R7RS-small]

`(span>=? `''span,,1,, span,,2,, span'' ...`)` [R7RS-small]

`(span-ci=? `''span,,1,, span,,2,, span'' ...`)` [R7RS-small]

`(span-ci<? `''span,,1,, span,,2,, span'' ...`)` [R7RS-small]

`(span-ci>? `''span,,1,, span,,2,, span'' ...`)` [R7RS-small]

`(span-ci<=? `''span,,1,, span,,2,, span'' ...`)` [R7RS-small]

`(span-ci>=? `''span,,1,, span,,2,, span'' ...`)` [R7RS-small]

In any implementation of this proposal based on R7RS, the results of the span procedures must behave as if the arguments were converted to strings and then passed to the corresponding string procedures.

== Comparator ==

`span-comparator`

This is a [http://srfi.schemers.org/srfi-114/srfi-114.html SRFI 114] comparator for comparing strings.  Its procedures behave as if their arguments are converted to strings and then passed to the procedures of `string-comparator`.

== Sample Implementation ==

The sample implementation (which is not yet written) represents spans as records containing a string and two string cursors, and provides two implementations of string cursors, one using string indexes directly and one that layers UTF-8 character spans on top of single-byte native strings.

In addition, it provides a library with the same API as this proposal, but operating on strings rather than spans and using `string-` in their names instead of `span-`.  This library can assist in converting SRFI-13-based programs that do not exploit the substring facilities.  There is also a shim for R5RS systems that do not yet provide the full R7RS-small string API.

time

2015-02-08 05:48:28