namespace aware XML parser
The Parser
module implements a namespace-aware XML
parser that additionally understands the optional DOCTYPE
declaration just enough to respect ENTITY
declarations. For example, if you place the following entity
declarations in your document's DOCTYPE
:
<!ENTITY section1 SYSTEM "foo/baz.xml"> <!ENTITY w3 "http://www.w3.org">
then any occurence of entity reference
§ion1;
causes the contents of file
foo/baz.xml to be included and any
occurrence of entity reference &w3;
is
expanded into http://www.w3.org
. The parser is also
able to strip whitespace text nodes on the fly according to a
user specification.
The Parser
modules exports the following
procedures:
{Parser.parse +Spec ?Tree}
Spec
and returns a parsed document in the form of a
Tree
. Spec
is a record with the following
optional features:
string:S
url:URL
file:FILE
S
, or from a
URL
or a FILE
context:CTX
strip:STRIP
STRIP
is a table indicating for which elements
the parser should strip isolated whitespace text nodes.
{Parser.newContext ?CTX}
{CTX.putPrefix +PREFIX +URI}
CTX
the declaration that
associates namespace prefix PREFIX
with namespace
uri URI
; both these values should be virtual
strings{CTX.intern +USR ?SYS}
USR
is a string representing a name with
possibly a namespace prefix. The return value
SYS
is the unique internal representation of the
corresponding expanded name; this is a record
qname( uri:URI name:LOC xname:XLOC )
where URI
is the uri of the namespace,
LOC
the local part of the name, and XLOC
is the full expanded name. XLOC
is just
LOC
when URI
is empty and is formed by
the concatenation of LOC
followed by ' @ '
followed by URI
otherwise. All three values are
atoms. XLOC
may be used as a key that uniquely
identifies the name.{CTX.clone ?CTX2}
CTX2
of
CTX
where new namespace prefix declarations may
be independently added.
{Parser.noParent +TREE1 ?TREE2}
parent
feature pointing to its parent. This makes it difficult to
display the trees in the Inspector. The
NoParent
recursively removes the parent
feature and you should typically invoke it on a tree and
inspect the result.Consider a file example.xml with the following contents:
<doc xmlns="my/name/space"> <title>Hello World</title> <p> <em>text</em> </p> </doc>
Now let's parse it, with no frills:
declare [P]={Link ['x-ozlib://duchier/xml/Parser.ozf']} {Inspect {P.noParent {P.parse init(file:'example.xml')}}}
We see that there is white space before the title
element, between the title
and the p
, after
the p
, and inside the p
on either side
of the em
. We are now going to tell the parser that
it should strip the white space nodes between the children of
doc
.
We create a context CTX
, add to it the declaration
that prefix foo
corresponds to namespace uri
my/name/space
, obtain the internal representation of
name doc
in this namespace, and add an entry for it
in the STRIP
table:
declare CTX={P.newContext} {CTX.putPrefix 'foo' 'my/name/space'} STRIP={NewDictionary} STRIP.({CTX.intern "foo:doc"}.xname) := true {Inspect {P.noParent {P.parse init(file:'example.xml' context:CTX strip:STRIP)}}}
as expected, the white space nodes between the children of
doc
were removed, but those surrounding
em
in p
were preserved. We can
additionally strip those too as follows:
STRIP.({CTX.intern "foo:p"}.xname) := true {Inspect {P.noParent {P.parse init(file:'example.xml' context:CTX strip:STRIP)}}}