Subject: Re: URI parsing
From: rpw3@rigden.engr.sgi.com (Rob Warnock)
Date: 6 Apr 2001 13:23:10 GMT
Newsgroups: comp.lang.lisp
Message-ID: <9akfvu$818lf$1@fido.engr.sgi.com>
Jochen Schmidt  <jsc@dataheaven.de> wrote:
+---------------
| Raymond Wiker wrote:
| >         I wrote a small (37 lines) function yesterday for parsing
| > URIs. When I compared it with NET.URI, I noticed that I didn't handle
| > fragments (internal anchors in html files). On the other hand, it
| > *does* handle usernames and passwords.
| 
| Thanks - my new parsing function of NET.URI it handles 
| scheme, authority, path, query and fragment parts. It is only 22 Lines 
| long. I don't thought that it would be a such good idea to bring in more 
| specialized fields, as after the scheme each URI is free to define it's
| own syntax.
+---------------

But at least for the "Common Internet Scheme Syntax" [RFC 1738 Section 3.1],
that is, anything that starts with "//" after the scheme (including the
"http:", "ftp:", & "telnet:" schemes), you definitely should parse *all*
of the elements (if present):

    //<user>:<password>@<host>:<port>/<url-path>

There have been published exploits recently that involved deceiving users
by formatting a "user" component that *looked* like a domain name but wasn't,
because of a later "@". See the following RISKS Digest articles for an
especially sneaky example:

    "Making something look hacked when it isn't"
    <URL:http://catless.ncl.ac.uk/Risks/21.16.html#subj5.1>

    "The risk of a seldom-used URL syntax"
    <URL:http://catless.ncl.ac.uk/Risks/21.16.html#subj6.1>
    <URL:http://catless.ncl.ac.uk/Risks/21.18.html#subj15.1>


-Rob

-----
Rob Warnock, 31-2-510		rpw3@sgi.com
SGI Network Engineering		<URL:http://reality.sgi.com/rpw3/>
1600 Amphitheatre Pkwy.		Phone: 650-933-1673
Mountain View, CA  94043	PP-ASEL-IA