Deriving Linguistic Resources from Treebanks

Deriving Linguistic Resources from Treebanks

This page under construction.

Welcome to the Dublin-Essex-Saarbrücken Treebank project! You are visitor number since we started counting on September 27th 1999!

We have two papers accepted for ACL-04 in Barcelona, in July.

We invite you to revisit this page since we are adding new material all the time. Recent additions (beg. August 2001) include copies of all our papers written so far. We are also very pleased to be able to announce that Josef Van Genabith and Andy Way were recently awarded £ 100,000 from Enterprise Ireland under their Basic Research scheme to carry out further research in this area for the next 3 years. Two postgraduate students Aoife Cahill and Mairead McCarthy started their programmes of study in October 2001.

Introduction

This page describes work done by Josef Van Genabith, Louisa Sadler, Anette Frank and Andy Way on manipulating Treebanks to develop other linguistic resources.

Probabilistic Unification Grammars (e.g. LFG-DOP: Bod and Kaplan, 1998) require large, high quality training corpora. These corpora have to provide tree structures with feature structure annotations. Such corpora are expensive to construct and hard to come by. The traditional procedure for constructing such corpora is to use a large-scale unification grammar (in the real world, this often means writing one yourself!) and parse text. Typically for each string in the input text the grammar will produce hundreds or thousands of candidate tree-feature structure pairs from which a highly trained linguist has to pick the best analysis for inclusion in the training corpus. This is time consuming and error prone. We have developed an alternative method. The basic idea is extremely simple. As input our method requires a treebank. From this we automatically compile the CF-PSG following the method of [Charniak,96]. We then manually annotate the CF-PSG with f-structure equations and provide macros for the lexical categories. Then (and this is the trick) we "reparse" the treebank entries (not the strings) simply following the annotations put in there by the original human annotators and while we do that solve the f-equations on the rules encountered in that process. This results in an f-structure induced by the best-fitting tree for the example at hand. If the f-structure annotations are deterministic, then the whole process is and we do not have to choose from hundreds or thousands of alternatives.

Papers

Papers have been presented at LFG-99, at the EACL Workshop on Linguistically Interpreted Corpora, at the ATALA Workshop on Treebanks, at LFG-2000, at LFG-2001, at LREC-2002, at LFG-2002, and at Treebanks and Linguistic Theories.

Downloadable Results

The publicly available subset of the AP Treebank consists of 100 sentences of newswire reports.

These are available here as:

We are beginning to make available the f-structures we have derived semi-automatically from the AP Treebank, developed at Lancaster.

Again, there are several sets of f-structures:

These will be made available in various other formats, including Latex, for optimal reusability.

Other potentially useful resources we have developed include:

Once we are satisfied that we have finished our work, we intend to make the grammars themselves available.

Other, newer results include work on automatically annotating the Penn II Treebank with LFG functional annotations. We reported on this work at the LREC-2002 workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data, and at LFG-2002. We are making the current, draft resources available:

Finally, you may wish to access some other treebanks. Some of the better known ones include:

Some of you may be interested in a new treebank project site set up in Paris, as well as a new book on Treebanks.

Andy Way. Last edited:31st July, 2002