Subject: Re: Splitting a string on a character... From: Erik Naggum <erik@naggum.net> Date: Tue, 07 May 2002 02:40:41 GMT Newsgroups: comp.lang.lisp Message-ID: <3229728040064576@naggum.net> * Cory Spencer | Just a quickie question - is there already a Common Lisp function that | will split a string on a given character? Most often, when people ask quickie questions, they have been working themselves through what one would think of as a labyrinth where they make brief excursions in the wrong direction and self-correct when they hit the wall, so to speak. When they hit the wall and do not self-correct, they post a quickie question, but there is an arbitary amount of back- tracking involved in providing the right answer. Just moving the person into a new labyrinth without the particular wall they have run into is seldom the best answer, as the wrong choice they have made will lead them right into another wall shortly thereafter. Therefore, a "quickie" is a strong signal to experienced problem-solvers that something is wrong: The requestor is stuck, but does not think he should have been. However, if his thinking were correct, he would not be stuck. Yet he is, and that is a hint that the amount of backtracking required will be significant and that is just the opposite of a "quickie". | ie) will perform a similar function as this: Generally speaking, a reader or parser of some sort. It is quite important to realize that you will never, ever have a case where you can entirely get rid of the "splitting" character. If you think you can legitimately expect this, you are just too inexperienced at what you are doing and will run into a problem sooner or later. Let me give you a few examples. Under Unix, you cannot have a colon in your login name, in your home directory name, in your real name, or in your shell, because the colon separates these fields in a system password file. (Not to mention null bytes and newlines.) This is just too dumb to be believable on the face of it, but it is actually the case. Unix freaks do not think this is a problem because they internalize the rules and do not _want_ a colon in those places. However, software that updates the password file has to do sanity checks in order not to expose the system to serious security risks because there is no way to escape a payload colon from the delimiting colon. In the standard Unix shells, whitespace separates arguments, but you have several escaping forms to allow whitespace to exist in arguments. All in all, the mechanisms that are used in the shell are quite arcane and difficult to predict from a program, but a user can usually deal with it, in the standard Unix idea of "usually". Then there is HTML and URL's and all that crap. To make sure that a character is always a payload character, it must be written as &#nnn, where nnn is the ISO 10646 code for character, or you have to engagge in table lookups, context-sensitive parsing rules, and all sorts of random weirdness. Likewise, in URL's, it is incredibly hard to get all you want through to the other side. Recently, I subscribed to the Unabridged Merriam-Webster dictionary, and they need the e-mail address as the username. It turned out to be very hard to write a URL that had a payload @ in the username and a syntax @ before the hostname. I actually find such things absolutely incredible -- to be so thoughtless must have been _really_ hard. This is why you should not use position to find a character to split on, you should use a state machine that traverses the string and finds only those (matching) characters that are syntactically relevant, not those (matching) characters that are (or should be) payload characters. A regular expression is _not_ sufficient for this task. -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief. 70 percent of American adults do not understand the scientific process.