XML Parser

Denys Duchier

provides
x-ozlib://duchier/xml/Parser.ozf
x-ozlib://duchier/xml/Parser-0.2.1.ozf

The earlier implementation of this module is still available as Parser-0.2.1.ozf and is documented by index-0.2.1.html.

Introduction

The Parser module implements a namespace-aware object-oriented XML parser. It understands just enough of the optional DOCTYPE declaration to respect ENTITY declarations. For example, if you place the following entity declarations in your document's DOCTYPE:

<!ENTITY section1 SYSTEM "foo/baz.xml">
<!ENTITY w3 "http://www.w3.org">

then any occurence of entity reference &section1; causes the contents of file foo/baz.xml to be included and any occurrence of entity reference &w3; is expanded into http://www.w3.org. The parser is also able to strip whitespace text nodes on the fly according to a user specification in the style of XSLT encapsulated in a SpaceManager.

Exports

The parser module exports the following features:

spaceManager
a class implementing the decision rules for stripping whitespace
parser
a class implementing the XML parser

Space Manager

A SpaceManager can answer the question: "should isolated text nodes consisting of only whitespace characters be discarded in this context?" The question must be parametrized by a URI (for the namespace) and a Tag (for the name of the element in which the whitespace text node occurs):

askStripSpace(+URI +Tag ?Bool)
askPreserveSpace(+URI +Tag ?Bool)

The default answer is no. Additional rules can be stated using the following methods:

stripSpace(+URI +Tag)
preserveSpace(+URI +Tag)

stripSpace(URI Tag) states that isolated text nodes consisting only of whitespace characters must be discarded when they occur as children of an element named Tag in namespace URI. For an element not in any namespace (this is different from an element in the default namespace) URI=unit.

It is possible to use wilcards '*' for either or both of the arguments. Thus stripSpace(URI '*') says that isolated whitespace text nodes occurring in any element of the URI namespace should be discarded. Additional rules can overrule in more specific cases. For example:

stripSpace(URI '*') preserveSpace(URI code)

states that isolated whitespace text nodes should be discarded in all elements of namespace URI, except for element code in which they should be preserved. Which rule takes precedence? Here is the hierarchy:

Low  :  ('*' '*')
Medium  :  (URI *) ('*' Tag)
High  :  (URI Tag)

There can be an ambiguity on Medium, in which case an error is raised when the question is asked.

When asking questions, i.e. with the methods askStripSpace and askPreserveSpace, no wildcard can be used, of course.

Parser

A Parser object can be used to parse XML documents and obtain their Oz representation. The Parser class can be subclassed to provide problem-specific methods for constructing the document representation. Of course, reasonable default implementations are provided.

init
initialization method used when creating a instance: ParserInstance = {New Parser init}

parseVS(+VS ?Doc)
parseFile(+Filename ?Doc)
parseURL(+URL ?Doc)
methods for parsing a XML document provided resp. as a virtual string, in a file, and at a URL. Returns the constructed document representation as second argument.

@keepComments
setKeepComments(+Bool)
attribute and method for setting it. It controls whether comment nodes are kept (or automatically discarded). The default value is false, i.e. they are discarded.

@keepNamespaceDeclarations
setKeepNamespaceDeclarations(+Bool)
attribute and method for setting it. It controls whether namespace declarations are kept (or by default discarded). The default value is false.

setSpaceManager(+M)
installs a SpaceManager to control the parser's behaviour with respect to isolated whitespace text nodes.

parseVS(+VS ?Doc)
parseFile(+Filename ?Doc)
parseURL(+URL ?Doc)
parse a document and return its representation.

onStartDocument()
onEndDocument()
invoked respectively at the start and end of the document

onStartElement(Tag Alist Children)
invoked on the start tag of an element. It is it's responsability to construct a representation of the element and to contribute it to the list of items currently being accumulated (by invoking the append(_) method). Its default definition is:
meth onStartElement(Tag Alist Children)
   {self append(
            element(
               uri        : Tag.uri
               name       : Tag.name
               attributes : Alist
               children   : Children))}
end
Tag is a record that describes the start tag and has the following features:

qname, prefix, uri and name are all atoms. Debug coordinates are records of the form: coord(Filename LineNumber). Alist is the list of accumulated attributes and possibly namespace declarations. Children is the, as yet uninstantiated, list of accumulated children of this element.

material contributed with append by onStartElement and onEndElement is added to the content list of the element's parent. See onStartChildren/onEndChildren for similar functionality adding to the element's own content list.

append(X)
contribute the item X to the contents list being accumulated for the current element.

onEndElement(Tag)
invoked on an end tag

onStartChildren(Tag)
onEndChildren(Tag)
invoked respectively just before and just after processing the children of an element. Material contributed at these points is added to the element's content list.

onAttribute(Tag Value)
invoked for each attribute of an element. It is its responsability to construct a representation of the attribute and to contribute it to the list of attributes currently being accumulated (by invoking the attributeAppend(_) method). Of course, attributes can be ignored by not contributing them. The default definition is:
meth onAttribute(Tag Value)
   {self attributeAppend(
            attribute(
               uri   : Tag.uri
               name  : Tag.name
               value : Value))}
end

Tag is a record describing the attribute's name and has features qname, prefix, name, uri, coord, with the same interpretation as for elements. Note that attributes without an explicit namespace prefix are always considered to be in no namespace, and not in the default namespace (if any).

It should be noted that the attributes of an element are processed before its onStartElement is called. The reason for this is that it is necessary to process all namespace declarations before attempting to interpret the tag.

onNamespaceDeclaration(Prefix URI Coord)
some attributes are really namespace declarations. This is identified by their xmlns prefix (using any possibly mixed capitalization as desired). Prefix and URI are both atoms, Coords is a debug coordinates record. The default implementation is similar to onAttribute's but additionally checks @keepNamespaceDeclarations and contributes only if it is true.

onProcessingInstruction(Name Data Coord)
Name is an atom, Data is a string, Coord is a debug coordinates record

onCharacters(Data Coord)
invoked for text nodes. Chars is a string, Coord is a debug coordinates record

onComment(Data Coord)
invoked on comment nodes. Data is a string, Coord is a debug coordinates record. Note that comment nodes are automatically discarded if @keepComments is false

Example

Here is an example that subclasses the Parser. Each instance of MyParser is given a SpaceManager that ignores all isolated whitespace nodes. Namespaces are ignored (this is for a trivial application where we assume that namespaces are not used). Each element is converted into a record whose label is the element's local name, and whose two features are: alist, a record whose features are the attributes, and children, a list of the children elements.
class MyParser from Parser
   meth init
      M = {New SpaceManager init}
   in
      {M stripSpace('*' '*')}
      Parser,init
      {self setSpaceManager(M)}
   end
   meth onAttribute(Tag Value)
      {self attributeAppend(Tag.name#Value)}
   end
   meth onStartElement(Tag Alist Children)
      Name = Tag.name
   in
      {self append(
               Name(
                  alist    : {List.toRecord alist Alist}
                  children : Children))}
   end
end

Denys Duchier