allegro-cl archives 2003-12-16 | home index prev H thread prev K thread next J next L |
From: Steve Haflich Subject: Re: XML parser and line feeds between tags Date: 2003-12-16 12:04 From: Laurent Eschenauer <pepite.be at laurent> I have an issue with the xml parser in ACL 6.2 (pxml) when using line feeds. Looking at the XML specs, I understand that the XML parser should ignore line feeds and extra whitespace. However when I parse the following file with ACL 6.2 : <team> <person id="b001" name="laurent eschenauer"/> <person id="b002" name="cedric gauthy"/> </team> Using the command :(parse-xml stream :content-only t) I receive: ((team " " ((person id "b001" name "laurent eschenauer")) " " ((person id "b002" name "cedric gauthy")) " ")) As you can see, all line feeds are handled by the parser as token while they should not be visible (according to the XML specs at http://www.xml.com/axml/testaxml.htm). I think you are misreading the XML standard. (The annotated one you cite is the 1998 version -- the 2000 revision is available at w3c.org, but I don't think it is any different in this regard.) Section 2.10 of the standard states: An XML processor must always pass all characters in a document that are not markup through to the application. That's unequivocal. It is also supported by the annotated test in the document you cite -- click on the secont circle-T annotation in Section 2.10.. A validating XML processor must also inform the application which of these characters constitute white space appearing in element content. The ACL pxml is _not_ a validating parser, but even if it were, a validating parser is still _required_ to pass back to the client application all whitespace that is not markup. (Very informally, that includes all whitespace not inside angle brackets or is inside.) A validating parser should differentiate whitespace that appears in places where regular character data cannot appear (e.g. between an <ol> and an <li> tag) because that whitespace cannot be part of the significant document text. However, it must still be presented to the application. The third circle-T annotation in 2.10 verifies this. At first it is hard to see the sense of this requirement. But consider an application such as an XML editor or XSL transformer. There is no reason it shouldn't make use of the same parser as any other application, but (since XML documents are supposed to be human readable as a fallback) such an editor might want much as possible to preserve whitespace so that serves only to make the document format more readable. This is also the reason parsers are _permitted_ (but not required) to inform applications about comments, as does the SAX interface. But it is still up to the application whether to ignore whitespace that appears in element content (i.e. where character data _may_ _not_ otherwise appear). The processor (parser) must still present it to the application. Of course, whitespace that appears where character data _may_ appear must also be preserved. Our phtml used to screw this up <i>foo</i> <i>bar</i> and run the two words together. (HTML has _very_ different rules for whitespace preservation, but the issue is similar.) A special attribute named xml:space may be attached to an element to signal an intention that in that element, white space should be preserved by applications. But even this attribute does not allow the parser to suppress whitespace. It is merely a suggestion to the client application where whitespace should be considered significant, for example, to differentiate sections of prose and poetry. |