Groups | Search | Server Info | Login | Register

Re: Tokenizer (Re: Most efficient way to read words from string.)

From	tpeplt <tpeplt@gmail.com>
Newsgroups	comp.lang.lisp, comp.lang.scheme
Subject	Re: Tokenizer (Re: Most efficient way to read words from string.)
Date	2025-08-26 15:02 -0400
Organization	A noiseless patient Spider
Message-ID	<87ldn5zz2k.fsf@gmail.com> (permalink)
References	<108jta2$3uj44$1@dont-email.me>

Cross-posted to 2 groups.

Show all headers | View raw

>> (defun string-split (str &optional (separator #\Space))
>>   "Splits the string STR at each SEPARATOR character occurrence.
>> The resulting substrings are collected into a list which is returned.
>> A SEPARATOR at the beginning or at the end of the string STR results
>> in an empty string in the first or last position of the list
>> returned."
>>   (declare (type string str)
>>            (type character separator))
>>   (loop for start  = 0 then (1+ end)
>>         for end    = (position separator str :start 0)
>>                      then (position separator str :start start)
>>         for substr = (subseq str start end)
>>                      then (subseq str start end)
>>         collect substr into result
>>         when (null end) do (return result)
>>         ))
>

1. With a for-as-equals-then clause, if the ‘then’ FORM2 is
omitted, then the FORM1 is evaluated for each iteration.
So, if FORM2 is identical to FORM1, it may be omitted.

See:
http://www.ai.mit.edu/projects/iiip/doc/CommonLISP/HyperSpec/Body/sec_6-1-2-1-4.html

2. ‘position’ returns NIL if the item is not found in the
sequence, so it is necessary to check the result of a call
to this function before using it with arithmetic functions.

Here is an alternative version of the function
‘string-split’:

(defun string-split (str &optional (separator #\Space))
  "Splits the string STR at each SEPARATOR character occurrence.
The resulting substrings are collected into a list which is returned.
A SEPARATOR at the beginning or at the end of the string STR results
in an empty string in the first or last position of the list
returned."
  (declare (type string str)
           (type character separator))
  (loop
    with end = 0
    for start = (position separator str
                          :start end :test-not #'equal)
    if start
      do (setf end (or (position separator str :start start)
                       (length str)))
      and collect (subseq str start end)
    until (null start)))

Some tests:

(string-split "    ")
;;=> NIL

(string-split "stringtolist")
;;=> ("stringtolist")

(string-split "stringtolist   ")
;;=> ("stringtolist")

(string-split "    stringtolist")
;;=> ("stringtolist")

(string-split "  stringtolist  ")
;;=> ("stringtolist")

(string-split "   string  to   list   ")
;;=> ("string" "to" "list")

(string-split "   string  to   list")
;;=> ("string" "to" "list")

(string-split "string  to   list   ")
;;=> ("string" "to" "list")

>
> Gauche Scheme
>
> "!" is similar to "do".
>
> (define (tokenize str separators)
>   (let ((seps (string->list separators)))
>     (! (ch :in (reverse (cons (car seps) (string->list str)))
>         := sep (member ch seps)
>         r cons (list->string tmp) :if (and (pair? tmp) sep)
>         tmp '() (if sep '() (cons ch tmp)))
>       #f r)))
>
> (tokenize "  foo; bar, baz, and ... zap" " ,;.")
>   ===>
> ("foo" "bar" "baz" "and" "zap")
>

3. Here is a generalized version of ‘string-split’,
supporting a string of separators as an argument and using
Common Lisp’s ‘position-if’/‘position-if-not’ functions in
place of ‘position’:

```
(defun string-split (str &optional (separators " ;,."))
  "Splits the string STR at each SEPARATOR character occurrence.
The resulting substrings are collected into a list which is returned.
A SEPARATOR at the beginning or at the end of the string STR results
in an empty string in the first or last position of the list
returned."
  (declare (type string str)
           (type string separators))
  (loop
    with bag = (coerce separators 'list) and end = 0
    for start = (position-if-not (lambda (ch) (member ch bag)) str
                                 :start end)
    if start
      do (setf end (or (position-if (lambda (ch) (member ch bag)) str
                                    :start start)
                       (length str)))
      and collect (subseq str start end)
    until (null start)))
```

Some tests:

(string-split "    " "- ")
;;=> NIL

(string-split "    " "-;")
;;=> ("    ")

(string-split "string-to-list" "- ")
;;=> ("string" "to" "list")

(string-split "string-to-list   " "- ")
;;=> ("string" "to" "list")

(string-split "    string-to-list" "- ")
;;=> ("string" "to" "list")

(string-split "  string-to-list  " "- ")
;;=> ("string" "to" "list")

(string-split "   string-to-list   " "- ")
;;=> ("string" "to" "list")

(string-split "   string-to-list" "- ")
;;=> ("string" "to" "list")

(string-split "string-to-list   " "- ")
;;=> ("string" "to" "list")

-- 
The lyf so short, the craft so long to lerne.
- Geoffrey Chaucer, The Parliament of Birds.

Back to comp.lang.scheme | Previous | Next — Previous in thread | Find similar

Thread

Re: Tokenizer (Re: Most efficient way to read words from string.) "B. Pym" <Nobody447095@here-nor-there.org> - 2025-08-26 09:04 +0000
  Re: Tokenizer (Re: Most efficient way to read words from string.) tpeplt <tpeplt@gmail.com> - 2025-08-26 15:02 -0400

csiph-web