XML Parser

provides: x-ozlib://duchier/xml/Parser.ozf; x-ozlib://duchier/xml/Parser-0.2.1.ozf

The earlier implementation of this module is still available as Parser-0.2.1.ozf and is documented by index-0.2.1.html.

Introduction

The Parser module implements a namespace-aware object-oriented XML parser. It understands just enough of the optional DOCTYPE declaration to respect ENTITY declarations. For example, if you place the following entity declarations in your document's DOCTYPE:

<!ENTITY section1 SYSTEM "foo/baz.xml">
<!ENTITY w3 "http://www.w3.org">

then any occurence of entity reference &section1; causes the contents of file foo/baz.xml to be included and any occurrence of entity reference &w3; is expanded into http://www.w3.org. The parser is also able to strip whitespace text nodes on the fly according to a user specification in the style of XSLT encapsulated in a SpaceManager.

Exports

The parser module exports the following features:

spaceManager: a class implementing the decision rules for stripping whitespace
parser: a class implementing the XML parser

Space Manager

A SpaceManager can answer the question: "should isolated text nodes consisting of only whitespace characters be discarded in this context?" The question must be parametrized by a URI (for the namespace) and a Tag (for the name of the element in which the whitespace text node occurs):

askStripSpace(+URI +Tag ?Bool)
askPreserveSpace(+URI +Tag ?Bool)

The default answer is no. Additional rules can be stated using the following methods:

stripSpace(+URI +Tag)
preserveSpace(+URI +Tag)

stripSpace(URI Tag) states that isolated text nodes consisting only of whitespace characters must be discarded when they occur as children of an element named Tag in namespace URI. For an element not in any namespace (this is different from an element in the default namespace) URI=unit.

It is possible to use wilcards '*' for either or both of the arguments. Thus stripSpace(URI '*') says that isolated whitespace text nodes occurring in any element of the URI namespace should be discarded. Additional rules can overrule in more specific cases. For example:


stripSpace(URI '*')
preserveSpace(URI code)

states that isolated whitespace text nodes should be discarded in all elements of namespace URI, except for element code in which they should be preserved. Which rule takes precedence? Here is the hierarchy:

Low	:	`('' '')`
Medium	:	`(URI ) ('' Tag)`
High	:	`(URI Tag)`

There can be an ambiguity on Medium, in which case an error is raised when the question is asked.

When asking questions, i.e. with the methods askStripSpace and askPreserveSpace, no wildcard can be used, of course.

Parser

A Parser object can be used to parse XML documents and obtain their Oz representation. The Parser class can be subclassed to provide problem-specific methods for constructing the document representation. Of course, reasonable default implementations are provided.

init

initialization method used when creating a instance: ParserInstance = {New Parser init}

parseVS(+VS ?Doc)

parseFile(+Filename ?Doc)

parseURL(+URL ?Doc)

methods for parsing a XML document provided resp. as a virtual string, in a file, and at a URL. Returns the constructed document representation as second argument.

@keepComments

setKeepComments(+Bool)

attribute and method for setting it. It controls whether comment nodes are kept (or automatically discarded). The default value is false, i.e. they are discarded.

@keepNamespaceDeclarations

setKeepNamespaceDeclarations(+Bool)

attribute and method for setting it. It controls whether namespace declarations are kept (or by default discarded). The default value is false.

setSpaceManager(+M)

installs a SpaceManager to control the parser's behaviour with respect to isolated whitespace text nodes.

parseVS(+VS ?Doc)

parseFile(+Filename ?Doc)

parseURL(+URL ?Doc)

parse a document and return its representation.

onStartDocument()

onEndDocument()

invoked respectively at the start and end of the document

onStartElement(Tag Alist Children)

invoked on the start tag of an element. It is it's responsability to construct a representation of the element and to contribute it to the list of items currently being accumulated (by invoking the append(_) method). Its default definition is:

meth onStartElement(Tag Alist Children)
   {self append(
            element(
               uri        : Tag.uri
               name       : Tag.name
               attributes : Alist
               children   : Children))}
end

Tag is a record that describes the start tag and has the following features:

qname: the tag's fullname as it appears in the document
prefix: just it's prefix (unit if none)
uri: the namespace uri bound to the prefix (unit if none)
name: the localname of the tag (i.e. minus prefix)
coord: the debug coordinates where the start tag occurred
endCoord: the, as yet uninstantiated, debug coordinates where the corresponding end tag occurs

qname, prefix, uri and name are all atoms. Debug coordinates are records of the form: coord(Filename LineNumber). Alist is the list of accumulated attributes and possibly namespace declarations. Children is the, as yet uninstantiated, list of accumulated children of this element.

material contributed with append by onStartElement and onEndElement is added to the content list of the element's parent. See onStartChildren/onEndChildren for similar functionality adding to the element's own content list.

append(X)

contribute the item X to the contents list being accumulated for the current element.

onEndElement(Tag)

invoked on an end tag

onStartChildren(Tag)

onEndChildren(Tag)

invoked respectively just before and just after processing the children
of an element.  Material contributed at these points is added to the element's
content list.

onAttribute(Tag Value)

invoked for each attribute of an element.  It is its
responsability to construct a representation of the attribute and to
contribute it to the list of attributes currently being accumulated
(by invoking the attributeAppend(_) method).  Of course,
attributes can be ignored by not contributing them.  The default definition is:
meth onAttribute(Tag Value)
   {self attributeAppend(
            attribute(
               uri   : Tag.uri
               name  : Tag.name
               value : Value))}
end

Tag is a record describing the attribute's name and
has features qname, prefix,
name, uri, coord, with the same
interpretation as for elements.  Note that attributes without an
explicit namespace prefix are always considered to be in no namespace, and not
in the default namespace (if any).
It should be noted that the attributes of an element are processed
before its onStartElement is called.  The reason for this
is that it is necessary to process all namespace declarations before
attempting to interpret the tag.

onNamespaceDeclaration(Prefix URI Coord)

some attributes are really namespace declarations.  This is
identified by their xmlns prefix (using any possibly
mixed capitalization as desired).
Prefix and URI are both atoms,
Coords is a debug coordinates record.
The default implementation is similar to onAttribute's but
additionally checks
@keepNamespaceDeclarations and contributes only if it is
true.

onProcessingInstruction(Name Data Coord)

Name is an atom, Data is a string,
Coord is a debug coordinates record

onCharacters(Data Coord)

invoked for text nodes. Chars is a string,
Coord is a debug coordinates record

onComment(Data Coord)

invoked on comment nodes. Data is a string,
Coord is a debug coordinates record.  Note that comment
nodes are automatically discarded if @keepComments is
false



Example

Here is an example that subclasses the Parser.  Each instance
of MyParser is given a SpaceManager that ignores
all isolated whitespace nodes.  Namespaces are ignored (this is for a trivial
application where we assume that namespaces are not used).  Each element
is converted into a record whose label is the element's local name, and
whose two features are: alist, a record
whose features are the attributes, and children, a list of
the children elements.
class MyParser from Parser
   meth init
      M = {New SpaceManager init}
   in
      {M stripSpace('*' '*')}
      Parser,init
      {self setSpaceManager(M)}
   end
   meth onAttribute(Tag Value)
      {self attributeAppend(Tag.name#Value)}
   end
   meth onStartElement(Tag Alist Children)
      Name = Tag.name
   in
      {self append(
               Name(
                  alist    : {List.toRecord alist Alist}
                  children : Children))}
   end
end




Denys Duchier