Path: csiph.com!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: tpeplt Newsgroups: comp.lang.lisp,comp.lang.scheme Subject: Re: Tokenizer (Re: Most efficient way to read words from string.) Date: Tue, 26 Aug 2025 15:02:11 -0400 Organization: A noiseless patient Spider Lines: 153 Message-ID: <87ldn5zz2k.fsf@gmail.com> References: <108jta2$3uj44$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Date: Tue, 26 Aug 2025 19:02:25 +0000 (UTC) Injection-Info: dont-email.me; posting-host="2d2b084358757c6aaa44d4cdbce92484"; logging-data="234293"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+o5kKUrAxNVdzVS86YuNByWryKT2z3j7o=" User-Agent: Gnus/5.13 (Gnus v5.13) Cancel-Lock: sha1:Mee6yWYL4ICkoQ/efLNESzchrd4= sha1:A+pT6gIVUE0dGLkCGBEEVT6jDM0= Xref: csiph.com comp.lang.lisp:60676 comp.lang.scheme:6561 >> (defun string-split (str &optional (separator #\Space)) >> "Splits the string STR at each SEPARATOR character occurrence. >> The resulting substrings are collected into a list which is returned. >> A SEPARATOR at the beginning or at the end of the string STR results >> in an empty string in the first or last position of the list >> returned." >> (declare (type string str) >> (type character separator)) >> (loop for start = 0 then (1+ end) >> for end = (position separator str :start 0) >> then (position separator str :start start) >> for substr = (subseq str start end) >> then (subseq str start end) >> collect substr into result >> when (null end) do (return result) >> )) > 1. With a for-as-equals-then clause, if the ‘then’ FORM2 is omitted, then the FORM1 is evaluated for each iteration. So, if FORM2 is identical to FORM1, it may be omitted. See: http://www.ai.mit.edu/projects/iiip/doc/CommonLISP/HyperSpec/Body/sec_6-1-2-1-4.html 2. ‘position’ returns NIL if the item is not found in the sequence, so it is necessary to check the result of a call to this function before using it with arithmetic functions. Here is an alternative version of the function ‘string-split’: (defun string-split (str &optional (separator #\Space)) "Splits the string STR at each SEPARATOR character occurrence. The resulting substrings are collected into a list which is returned. A SEPARATOR at the beginning or at the end of the string STR results in an empty string in the first or last position of the list returned." (declare (type string str) (type character separator)) (loop with end = 0 for start = (position separator str :start end :test-not #'equal) if start do (setf end (or (position separator str :start start) (length str))) and collect (subseq str start end) until (null start))) Some tests: (string-split " ") ;;=> NIL (string-split "stringtolist") ;;=> ("stringtolist") (string-split "stringtolist ") ;;=> ("stringtolist") (string-split " stringtolist") ;;=> ("stringtolist") (string-split " stringtolist ") ;;=> ("stringtolist") (string-split " string to list ") ;;=> ("string" "to" "list") (string-split " string to list") ;;=> ("string" "to" "list") (string-split "string to list ") ;;=> ("string" "to" "list") > > Gauche Scheme > > "!" is similar to "do". > > (define (tokenize str separators) > (let ((seps (string->list separators))) > (! (ch :in (reverse (cons (car seps) (string->list str))) > := sep (member ch seps) > r cons (list->string tmp) :if (and (pair? tmp) sep) > tmp '() (if sep '() (cons ch tmp))) > #f r))) > > (tokenize " foo; bar, baz, and ... zap" " ,;.") > ===> > ("foo" "bar" "baz" "and" "zap") > 3. Here is a generalized version of ‘string-split’, supporting a string of separators as an argument and using Common Lisp’s ‘position-if’/‘position-if-not’ functions in place of ‘position’: ``` (defun string-split (str &optional (separators " ;,.")) "Splits the string STR at each SEPARATOR character occurrence. The resulting substrings are collected into a list which is returned. A SEPARATOR at the beginning or at the end of the string STR results in an empty string in the first or last position of the list returned." (declare (type string str) (type string separators)) (loop with bag = (coerce separators 'list) and end = 0 for start = (position-if-not (lambda (ch) (member ch bag)) str :start end) if start do (setf end (or (position-if (lambda (ch) (member ch bag)) str :start start) (length str))) and collect (subseq str start end) until (null start))) ``` Some tests: (string-split " " "- ") ;;=> NIL (string-split " " "-;") ;;=> (" ") (string-split "string-to-list" "- ") ;;=> ("string" "to" "list") (string-split "string-to-list " "- ") ;;=> ("string" "to" "list") (string-split " string-to-list" "- ") ;;=> ("string" "to" "list") (string-split " string-to-list " "- ") ;;=> ("string" "to" "list") (string-split " string-to-list " "- ") ;;=> ("string" "to" "list") (string-split " string-to-list" "- ") ;;=> ("string" "to" "list") (string-split "string-to-list " "- ") ;;=> ("string" "to" "list") -- The lyf so short, the craft so long to lerne. - Geoffrey Chaucer, The Parliament of Birds.