Deriving an LFG from a Treebank Resource
Josef Van Genabith, Louisa Sadler & Andy Way
High quality training corpora are crucial for statistical approaches to natural language processing. For probabilistic Lexical Functional Grammars (LFG-DOP, Bod & Kaplan 1998) significant corpora of texts associated with both c-structure and f-structure representations are required. This poses an important acquisition problem: manual construction is time-consuming and error-prone while semi-automatic construction usually itself involves developing an large-scale grammar, parsing with it and using a linguistic expert to tame the resultant explosion of alternative analyses. We report in this paper on a series of experiments using existing tree-bank resources to derive (semi-automatically) appropriate training resources for probabilistic LFGs as an alternative to these time-consuming methods. In the first phase of work, we induce a CF-PSG from the treebank using the method described in (Charniak, 1996), manually annotate it with functional schemata and use the resultant LFG to deterministically ``reparse'' the original treebank representations simply following the c-structure defined by the original annotations, thereby inducing f-structures corresponding to the original c-structures. Some results from this method are given in Van Genabith et al., 1999. This paper describes two extensions to this work. In the first, we improve the quality of the resource. The annotated grammars of Van Genabith et al., 1999 are not yet stand-alone LFGs for they do not encode the LFG account of subcategorization in terms of semantic forms (as PRED values) and completeness and coherence constraints. We develop an automatic method for compiling LFG semantic forms from the treebanks annotated with f-structure representations, as a further step towards the automatic induction of a true LFG. In addition, our method provides full semantic forms as PRED values for the f-structures obtained from the treebank in Van Genabith et al., 1999. In the second extension, we use grammar compaction techniques with a view to further decreasing the amount of manual work involved in our method. Observing that the very rich preterminal tagset of the AP corpus is very useful for providing f-structural information but rather redundant in that it leads to an explosion of minimally different PS rules, we develop a two stage methodology in which we first extract the f-structural information and then shrink the size of the tagset before grammar induction. This significant reduces the size of the grammar. In the final version of the paper we will evaulate in detail this two-stage approach.