Research Opportunities

I have a number of grants in the pipeline. If successful, I will be looking for postdoctoral researchers and postgraduate students in the areas of corpus collection, automatic alignment (of strings, trees etc.) at sentential and sub-sentential levels, probabilistic parsing, corpus-based machine translation, and Lexical-Functional Grammar. If you would like to be considered for any such positions which become vacant, and starting any time up to and including October 2005, do not hesitate to email me.


Andy Way's Current Research Interests

I've (finally!) put my publications on-line.

This page details my current research interests, which boil down to:

The first of these essentially describes my thesis work together with recent extensions both on parameter estimation and testing out the theoretical ideas on large datasets.

The second describes ongoing work which we first became interested in in 1997. Our newer research efforts seed the EBMT memories with translations obtained from on-line MT systems. In addition, we validate translations proposed by our system using the Web as a large corpus of target strings. We have also recently experimented using controlled language data.

The third area of interest describes efforts to produce automatically an f-structure bank for each sentence in the Penn-II Treebank using annotated PCFGs. These datasets could ultimately be of use in LFG-DOP-based approaches to language and translation modelling. We hope to make available the fuill set of Penn-II f-structures in 2004. We have also ported our method to German, and have extracted f-structures from the TIGER treebank. Finally, using the PCFGs extracted from Penn-II, we can now tag, parse and generate f-structures for any sentence of English.

Newer work I am just beginning involves:

The work on the first of these, with postgraduate student Declan Groves, involves combining insights from EBMT and PCFG parsing into statistical models of translation. The second, with postgraduate students Anna Khasin and Bart Mellebeek, and Josef van Genabith involves tricking existing, wide-coverage, commercially available MT systems into producing better translations. The project uses existing technology and tries to "spoon-feed" MT systems to achieve improved translations.

Robust, Hybrid Machine Translation

We have developed LFG-DOT models for Machine Translation (MT) based on Data-Oriented Parsing (DOP) allied to the syntactic representations of Lexical Functional Grammar (LFG).

We begin by showing that in themselves, none of the main paradigmatic approaches to MT currently suffice to the standard required. Nevertheless, each of these approaches contains elements which if properly harnessed should lead to an overall improvement in translation performance. It is in this new hybrid spirit that our search for a better solution to the problems of MT can be seen.

We summarise the original DOP model (Bod 1992, 1995, 1998, 2003), as well as the DOT model of translation on which it is based (Poutsma 1998, 2000). We demonstrate that DOT is not guaranteed to produce the correct translation, despite provably deriving the most probable translation. We go on to critically evaluate previous attempts at LFG-MT, commenting briefly on particular problem cases for such systems. We then show how the LFG-DOP model of Bod & Kaplan (1998) can be extended to serve as a novel hybrid model for MT which promises to improve upon DOT, as well as the pure LFG-based translation model.

This work, extending ideas in my thesis, is currently being undertaken with a PhD student of mine, Mary Hearne. See Mary's link for recent papers on this topic at MT Summit 2003, and at IJCNLP 2004.

Papers

An introductory paper on this work appeared in a special issue of JETAI on Memory-Based Learning in 1999. More specific information on this special issue can be found here.

Papers on translation using DOP and LFG-DOP have been/will be presented at:

For recent overviews of this work, see also Chapter 16 of my recent new book (with Michael Carl) on EBMT, and Chapter 19 of Bod, R; Scha, R; Sima'an, K (editors) Data-Oriented Parsing CSLI Publications 2003.

Example-Based Machine Translation using the Marker Hypothesis

Example-Based Machine Translation (EBMT) is an empirical approach to the problems of Machine Translation (MT) which tries to translate new strings on the basis of previously seen examples stored in the system's databases. As a very basic example, we might try to translate the sentence "I went to the baker's" into French with recourse to the following examples stored in the system's memory: That is, we might be able to figure out that the following partial translations may be useful in translating our new input string: If so, we can recombine the target fragments to generate the correct target string (in this instance!).

The nature of the examples stored in memory may differ considerably: they may be pure strings, or tagged with POS- or HTML-tags, or they may be aligned PS-trees. Even the LFG-DOT approach above may be considered an instance of EBMT: this time, we are storing not only aligned trees but also their accompanying LFG f-structures.

We have been conducting Marker-Based EBMT in DCU since 1997. The Marker Hypothesis (Green, 1979) is a universal psycholinguistic constraint which states that natural languages are 'marked' for complex syntactic structure at surface form by a closed set of specific lexemes and morphemes. That is, a basic phrase-level segmentation of an input sentence can be achieved by exploiting a closed list of known marker words to signal the start and end of each segment. Consider the following example, selected at random from the Wall Street Journal section of the Penn-II Treebank:

Here we see that three noun phrases start with determiners and one with a possessive pronoun. The sets of determiners and possessive pronouns are both very small. Furthermore, there are four prepositional phrases, and the set of prepositions is similarly small. A further assumption that could be made is that all words which end with '-ed' are verbs, such as 'stopped in 'Dearborn'. The Marker Hypothesis is arguably universal in presuming that concepts and structures like these have similar morphological or structural marking in all languages.

The first paper we published was:

I'm currently working on this area of EBMT with a PhD student of mine, Nano Gough. We have published some more recent papers on Marker-Based EBMT, including: Readers interested in EBMT in general should consult my new book (with Michael Carl)!

Automatic Compilation of Linguistic Corpora from Treebanks

Traditionally, unification grammars are hand-coded. This is extremely time consuming, expensive and very difficult to scale. Together with Josef van Genabith, and research students Aoife Cahill, Mick Burke and Ruth O'Donovan, we have developed a new method for automatically extracting wide-coverage probabilistic unification (LFG) grammars from treebank resources. To achieve this, we first automatically annotate the treebank (such as Penn-II) with feature-structure information (LFG f-structures, approximating to basic predicate-argument structure). From the f-structure annotated treebank, we then automatically extract wide-coverage, probabilistic LFG approximations to parse new text.We have applied this methodology also to German (Tiger treebank) and are planning to migrate it to other more diverse languages.

For much more information on this work, check out our Treebank page.


January, 2004