Re: string processor - Naggum cll archive

Subject: Re: string processor
From: Erik Naggum <erik@naggum.net>
Date: 2000/08/10
Newsgroups: comp.lang.lisp
Message-ID: <3174939482743037@naggum.net>

* fsumera@cs.rmit.edu.au (Irma)
| is there any way to seperate parts of a sentence

  There's the function position to find the position of a single
  element in a sequence, and the function search to locate a
  subsequence.  Then there's the function subseq to extract a
  subsequence.

| (something like stringtokenizer in java)

  Well, instead of implementing a class like StringTokenizer with
  Enumerator stuff like hasMoreElements and nextElement, the customary
  Common Lisp approach is to return a list of tokens.  Whether you
  wish to iterate over the returned sequence of tokens or do something
  else is up to you, but if so, you do it with ordinary list traversal.

| for example i've got the string : "things1 : things2"
| and i want to seperate them by the ':'

  There are basically two kinds of separators: Individual characters
  (such as are implemented by java.util.StringTokenizer) and strings.
  I tend to favor the strings model, as that allows more interesting
  cases.  Individual characters is also a subcase of strings.

  There are also basically two ways to deal with separators: Either
  you select any from a set repeatedly (StringTokenizer default), or
  you match the separators in a list in order (nextElement with an
  argument).

  Then there are basically two ways to return tokens: Either your
  separators are noise and you discard them, or you return them as
  tokens in their own right (both are available, determinable by the
  returnTokens argument to the constructor).

  Finally, there are basically two ways to treat two separators back
  to back in the target string: Either it means there is an empty
  token between them, or there isn't (the StringTokenizer way).

  Suppose we make a random sampling of decisions and choose strings in
  order that are not significant but with empty elements in between.
  (Symbols with <> around them would be lambda-bound.)

(loop
    for delimiter in <delimiters>
    for match-start = <start> then next
    for match-end = (search delimiter <target> :start2 match-start :end2 <end>)
    with next
    while match-end
    collect (subseq <target> match-start match-end)
    do (setq next (+ match-end (length delimiter))))

  <start> would default to 0, <end> to nil (which means the length of
  the target sequence).

  Incidentally, note that none of the code above knows it's dealing
  with strings, but it would be very inefficient for lists of stuff.

  Considering the number of options, and the ease with which you can
  write a loop that collects your particular substrings, this is an
  argument for the software-pattern mode of solution, specifically
  that using the general form is more complex than using the pattern.
  This was highlighted by a long-winded discussion on how to implement
  a tokenizer some time ago -- it grew ever more powerful and harder
  to use.  Incidentally, I think java.util.StringTokenizer is silly
  both in its limitedness and its complexity of use, but studying Java
  is not a very entertaining task, except maybe if you're coming at
  Java from below, like from C++.

| I've been using the 'read-from-string' function to get one string at
| a time but i guess that must be a better way to that.

  You _really_ don't want to use the Lisp reader unless you have Lisp
  objects of some sort.  Just trust me on this for now -- there's no
  need to figure it out the hard way.

#:Erik
-- 
  If this is not what you expected, please alter your expectations.