The earlier implementation of this module is still available asParser-0.2.1.ozf
and is documented byindex-0.2.1.html .
The Parser
module implements a namespace-aware object-oriented XML parser.
It understands just enough of the optional
DOCTYPE
declaration to respect ENTITY
declarations. For example, if you place the following entity
declarations in your document's DOCTYPE
:
<!ENTITY section1 SYSTEM "foo/baz.xml"> <!ENTITY w3 "http://www.w3.org">
then any occurence of entity reference
§ion1;
causes the contents of file
foo/baz.xml to be included and any
occurrence of entity reference &w3;
is
expanded into http://www.w3.org
. The parser is also
able to strip whitespace text nodes on the fly according to a
user specification in the style of XSLT encapsulated in a SpaceManager
.
The parser
module exports the following features:
spaceManager
parser
A SpaceManager
can answer the question: "should
isolated text nodes consisting of only whitespace characters be
discarded in this context?" The question must be parametrized by
a URI (for the namespace) and a Tag (for the name of the element in
which the whitespace text node occurs):
askStripSpace(+URI +Tag ?Bool) askPreserveSpace(+URI +Tag ?Bool)
The default answer is no. Additional rules can be stated using the following methods:
stripSpace(+URI +Tag) preserveSpace(+URI +Tag)
stripSpace(URI Tag)
states that isolated text nodes
consisting only of whitespace characters must be discarded when they
occur as children of an element named Tag
in namespace
URI
. For an element not in any namespace (this is
different from an element in the default namespace)
URI=unit
.
It is possible to use wilcards '*'
for either or both
of the arguments. Thus stripSpace(URI '*')
says that
isolated whitespace text nodes occurring in any element of the
URI
namespace should be discarded. Additional rules can
overrule in more specific cases. For example:
stripSpace(URI '*')
preserveSpace(URI code)
states that isolated whitespace text nodes should be discarded in
all elements of namespace URI
, except for element
code
in which they should be preserved. Which rule takes
precedence? Here is the hierarchy:
Low | : | ('*' '*') |
Medium | : | (URI *) ('*' Tag) |
High | : | (URI Tag) |
There can be an ambiguity on Medium, in which case an error is raised when the question is asked.
When asking questions, i.e. with the methods
askStripSpace
and askPreserveSpace
, no
wildcard can be used, of course.
Parser
object can be used to parse XML documents and
obtain their Oz representation. The Parser
class can be subclassed
to provide problem-specific methods for constructing the document representation.
Of course, reasonable default implementations are provided.
init
ParserInstance = {New Parser init}
parseVS(+VS ?Doc)
parseFile(+Filename ?Doc)
parseURL(+URL ?Doc)
@keepComments
setKeepComments(+Bool)
false
, i.e. they are discarded.
@keepNamespaceDeclarations
setKeepNamespaceDeclarations(+Bool)
false
.
setSpaceManager(+M)
SpaceManager
to control the parser's
behaviour with respect to isolated whitespace text nodes.
parseVS(+VS ?Doc)
parseFile(+Filename ?Doc)
parseURL(+URL ?Doc)
onStartDocument()
onEndDocument()
onStartElement(Tag Alist Children)
append(_)
method). Its default definition is:
meth onStartElement(Tag Alist Children) {self append( element( uri : Tag.uri name : Tag.name attributes : Alist children : Children))} end
Tag
is a record that describes the start tag and has the
following features:
unit
if none)
unit
if none)
qname
, prefix
, uri
and
name
are all atoms. Debug coordinates are records of the
form: coord(Filename LineNumber)
. Alist
is
the list of accumulated attributes and possibly namespace
declarations. Children
is the, as yet uninstantiated,
list of accumulated children of this element.
material contributed with append
by onStartElement
and
onEndElement
is added to the content list of the element's
parent. See onStartChildren
/onEndChildren
for similar
functionality adding to the element's own content list.
append(X)
X
to the contents list being
accumulated for the current element.
onEndElement(Tag)
- invoked on an end tag
onStartChildren(Tag)
onEndChildren(Tag)
- invoked respectively just before and just after processing the children
of an element. Material contributed at these points is added to the element's
content list.
onAttribute(Tag Value)
- invoked for each attribute of an element. It is its
responsability to construct a representation of the attribute and to
contribute it to the list of attributes currently being accumulated
(by invoking the
attributeAppend(_)
method). Of course,
attributes can be ignored by not contributing them. The default definition is:
meth onAttribute(Tag Value)
{self attributeAppend(
attribute(
uri : Tag.uri
name : Tag.name
value : Value))}
end
Tag
is a record describing the attribute's name and
has features qname
, prefix
,
name
, uri
, coord
, with the same
interpretation as for elements. Note that attributes without an
explicit namespace prefix are always considered to be in no namespace, and not
in the default namespace (if any).
It should be noted that the attributes of an element are processed
before its onStartElement
is called. The reason for this
is that it is necessary to process all namespace declarations before
attempting to interpret the tag.
onNamespaceDeclaration(Prefix URI Coord)
- some attributes are really namespace declarations. This is
identified by their
xmlns
prefix (using any possibly
mixed capitalization as desired).
Prefix
and URI
are both atoms,
Coords
is a debug coordinates record.
The default implementation is similar to onAttribute
's but
additionally checks
@keepNamespaceDeclarations
and contributes only if it is
true
.
onProcessingInstruction(Name Data Coord)
Name
is an atom, Data
is a string,
Coord
is a debug coordinates record
onCharacters(Data Coord)
- invoked for text nodes.
Chars
is a string,
Coord
is a debug coordinates record
onComment(Data Coord)
- invoked on comment nodes.
Data
is a string,
Coord
is a debug coordinates record. Note that comment
nodes are automatically discarded if @keepComments
is
false
Parser
. Each instance
of MyParser
is given a SpaceManager
that ignores
all isolated whitespace nodes. Namespaces are ignored (this is for a trivial
application where we assume that namespaces are not used). Each element
is converted into a record whose label is the element's local name, and
whose two features are: alist
, a record
whose features are the attributes, and children
, a list of
the children elements.
class MyParser from Parser meth init M = {New SpaceManager init} in {M stripSpace('*' '*')} Parser,init {self setSpaceManager(M)} end meth onAttribute(Tag Value) {self attributeAppend(Tag.name#Value)} end meth onStartElement(Tag Alist Children) Name = Tag.name in {self append( Name( alist : {List.toRecord alist Alist} children : Children))} end end