<< Previous | - Up - | Next >> |
In this chapter, we describe the grammar specification language using an illustrative example. We implement in this chapter an enhanced version of the example grammar presented in (Duchier and Debusmann 2001) which handles several phenomena associated with the German verb cluster. The grammar file is included in this package under the name "grammar-acl.dg". Notice that we encourage the user to use the extension "dg" throughout for all grammar files.
Here you define what components of the parser the grammar uses:
defuses {id lp}I.e. we use the ID and the LP-component. The ID-component must be used, whereas the LP-component is optional. To write a grammar that does not include word order constraints (i.e. only uses the ID-component), you would write:
defuses {id}Note: The default is that both the ID and the LP-components are used.
New: from version 1.2, you can also use the TH-component (th).
Here we define the types which will be used in the grammar:
deftypes { EDGELABELID : {det subject object vinf vpast zuvinf zu} EDGELABELLP : {df mf vc xf zuf} NODELABELLP : {d n v z} PERSON : {1 2 3} GENDER : {masc fem neut} NUMBER : {sg pl} DEF : {def indef undef} CASE : {nom gen dat acc} AGR : PERSON * GENDER * NUMBER * DEF * CASE }
Types are named by variables, i.e. by identifiers beginning with an
uppercase letter, e.g. EDGELABELID
. Variables are defined
as a set of constants which must begin with a lowercase letter. Here,
EDGELABELID
is defined as a set of grammatical roles.
AGR
is defined as the cartesian product of
PERSON
, GENDER
, NUMBER
,
DEF
and CASE
. Note that each element of a
domain is a typed constant, e.g. det
has type
EDGELABELID
. As a consequence, you must use distinct
constants for different domains. Also note that the type
EDGELABELID
must be defined for each grammar.
EDGELABELLP
and NODELABELLP
must be
defined in each grammar that uses the LP-component, and
EDGELABELTH
in each grammar that uses the TH-component.
Next, we define the features of a lexical entry.
defentry { edgeID : EDGELABELID set valencyID : EDGELABELID valency edgeLP : EDGELABELLP set nodeLP : NODELABELLP set valencyLP : EDGELABELLP valency blocks : EDGELABELID aset agrs : AGR set }
Each feature is typed, and set
, aset
and
valency
are builtin type constructors. For example,
EDGELABELID set
denotes a set of grammatical roles. The
difference between set
and aset
has impact
on maximal values and inheritance which we explain in 2.8.
In each grammar, the features edgeID
and
valencyID
are obligatory. Each grammar that uses the
LP-component must include edgeLP
, nodeLP
,
valencyLP
and blocks
. Each grammar using the
TH-component must include edgeTH
, valencyTH
,
link
, raisedsubj
and blocksTH
.
The features edgeID
, edgeLP
,
nodeLP
, blocks
, edgeTH
,
raisedsubj
and blocksTH
must be either of
type set
or aset
. valencyID
,
valencyLP
and valencyTH
must be of type
valency
and link
of type EDGELABELTH
-> EDGELABELID aset
.
The union of the sets EDGELABELLP
and
NODELABELLP
must be totally ordered. We specify the order
as follows:
{ z zuf d df n mf vc v xf }
For each word, there is a corresponding sign with the following internal structure:
sign( lex : o(index: Index word : Word entry: Entry) id : NodeID lp : NodeLP th : NodeTH attribute : AttributeRecord)
where Index
is the index of the selected entry,
Word
the string corresponding to the node and
Entry
the selected lexical entry
itself. NodeID
holds the information for the occurrence
of the node in the ID tree. If the LP-component is used,
NodeLP
bears the corresponding information for its
occurrence in the LP tree and (if the TH-component is used)
NodeTH
in the TH graph.
AttributeRecord
is a record that holds additional
attributes which are introduced as follows:
defattributes { agr : AGR }
Defining constraints for these attributes are then specified as follows:
defnode { _[agr] in _.lex.entry.agrs }
In other words, each one of these attributes is introduced in order to pick one of the values licensed by the lexical entry.
We stipulate edge constraints next. First, in the ID tree:
defedges id { det { _[agr] = ^[agr] } subject { _[agr] = ^[agr] _[agr] in $ nom } object { _[agr] in $ acc } }
For example, this states that for an edge labeled det
to
be licensed, the daughter must agree with its mother
(i.e. _[agr]=^[agr]
). _
denotes the
`current' node, and ^
its head. The notation
_[agr]
is equivalent to _.attribute.agr
and
is merely supported for convenience.
We can define similar constraints for edges in the LP tree. In the example grammar, we do not define any constraints for edges in the LP tree:
defedges lp { __ { } }
Here, __
matches any edge label.
For the parser, we must specify a distribution strategy. Currently, we can specify the sequence of features on which to perform labeling:
defdistribute { _.id.mothers _.id.daughterSet _.lp.mothers _.lp.daughterSet _.lp.nodeLP _.lp.pos }
This says to first perform labeling on the the ID mothers, then on the ID daughter sets, the LP mothers, the LP daughter sets, the node labels and the position.
Finally, we need to specify a lexicon. The lexicon can be specified on the basis of lexical types which can be combined using lexical inheritance to obtain lexical entries. In "grammar-acl.dg", finite verbs inherit from the following lexical type:
defword t_fin { edgeID : {} valencyID : {subject} edgeLP : {} nodeLP : {v} valencyLP : {mf* xf?} blocks : {det subject object vinf vpast vpast zuvinf} }
This lexical type indicates that the set of accepted roles
edgeID
of a finite verb denotes the empty set and that
finite verbs always subcategorize for a subject by their role valency
valencyID
. The set of accepted fields is empty and the
set of accepted node labels includes only v
. By its
field valency (lexical attribute valencyLP
), a finite
verb offers a Mittelfeld (mf
) and an extraposition fiel
(xf
). It blocks the set of all roles.
Valency (i.e. the attributes valencyID
and
valencyLP
) is specified using wildcard notation:
e.g. subject
indicates that exactly one syntactic
dependent with edge label subject
is
required. xf?
indicates that at most one topological
dependent with edge label xf
is permitted and
mf*
that any number of dependents with edge label
mf
is permitted. One or more dependents are indicated by
a +
.
As in the example, we can omit lexical attributes. An omitted
attribute is assigned its maximal value which depends on the
attribute's type. If the attribute is of type set
, the
maximal value is its range. Hence in the specification of the lexical
type t_fin
above, the omitted attribute agrs
is assigned the set AGR
of all agreement tuples. If the
omitted lexical attribute is of type aset
or of type
valency
, its maximal value is the empty set.
Transitive verbs inherit from the following lexical type:
defword t_tr { valencyID : {object} }
t_tr
only specifies the valencyID
-attribute,
stating that an object is required. All the other lexical attributes
are assigned their maximal values.
Here is how we obtain the lexical entry for the word
liebt
, using lexical inheritance:
defword liebt t_fin t_tr { agrs : $ 3 & sg & nom }
The lexical entry for liebt
defines only the value of the
lexical attribute agrs
and the other lexical attributes
are assigned their maximal values. In the specification of the
agrs
-attribute, the prefix operator $
introduces a set generator which is a boolean expression that
generates values for the corresponding type. For example, $ 3
& sg & nom
denotes the set of agreement tuples that are
3rd person, singular and nominative. In addition, the lexical entry
inherits from the lexical types t_fin
and
t_tr
, stating that it is both a finite verb and a
transitive verb.
Notice that there can of course be several entries for one word form.
Also note that lexical entries have to be escaped using quotation
characters if there is an identical type in the
deftypes
-section. In "grammar-acl.dg", we do not have to
escape liebt
because there is no identical type defined
in the deftypes
-section. However, if there would be a
lexical entry for the word subject
, we would have to
escape it and write 'subject'
instead. Further notice
that in this implementation, we do not distinguish between lexical
types and lexical entries: both are defined in exactly the same
way. It is however convenient to notationally distinguish lexical
types from lexical entries, and we adopt for this reason the
notational convention to prefix lexical types with t_
.
Lexical inheritance proceeds differently for each lexical attribute
depending on its type. It amounts to set intersection if the lexical
entry is of type set
and to set union if it is of type
aset
or valency
. For instance, this is how
the lexical entry for liebt
is obtained:
t_fin |
t_tr |
liebt |
liebt t_fin t_tr |
|
edgeID : EDGELABELID set |
{} |
EDGELABELID (max) |
EDGELABELID (max) |
{} |
valencyID : EDGELABELID valency |
{subject} |
{object} |
{} (max) |
{subject object} |
edgeLP : EDGELABELLP set |
{} |
EDGELABELLP (max) |
EDGELABELLP (max) |
{} |
nodeLP : NODELABELLP set |
{v} |
{d n v} (max) |
{d n v} (max) |
{v} |
valencyLP : EDGELABELLP valency |
{mf* xf?} |
{} (max) |
{} (max) |
{mf* xf?} |
blocks : EDGELABELID aset |
EDGELABELID |
{} (max) |
{} (max) |
EDGELABELID |
agrs : AGR set |
AGR (max) |
AGR (max) |
$ 3 & sg & nom |
$ 3 & sg & nom |
In the table above, we display the lexical attributes in the leftmost
column. The second column from the left indicates the values of these
lexical attributes specified by the lexical type t_fin
,
the third those specified by t_tr
and the fourth those
specified by lexical entry liebt
. We display the
resulting values in the rightmost column. Notice that we annotate
those values with (max) which are omitted in the respective
lexical specification and which are therefore assigned their maximal
values. As can be seen from the example, inheritance amounts to set
intersection for the lexical attributes with type set
,
i.e. edgeID
, edgeLP
,
nodeLP
and agrs
. Lexical inheritance
amounts to set union for the lexical attributes with type
aset
and valency
, i.e.
valencyID
, valencyLP
and
blocks
.
Notice that inheritance proceeds slightly differently for valencies
than for normal accumulative set lattices. If two elements with the
same edge label but with different cardinality are to be combined,
only the most specific of the two is contained in the resulting set.
The order of specifity is as follows for an edge label r
:
(r not in the valency set) < r* < r? < r+ < r
If we for instance combine the valency set {subj? adv*}
with the valency set {subj adv?}
, the result is
not {subj? subj adv* adv?}
but {subj
adv?}
because subj
is more specific that
subj?
and adv?
is more specific than
adv*
.
<< Previous | - Up - | Next >> |