I have a number of grants in the pipeline. If successful, I will be looking for postdoctoral researchers and postgraduate students in the areas of corpus collection, automatic alignment (of strings, trees etc.) at sentential and sub-sentential levels, probabilistic parsing, corpus-based machine translation, and Lexical-Functional Grammar. If you would like to be considered for any such positions which become vacant, and starting any time up to and including October 2005, do not hesitate to email me.
Andy Way's Current Research Interests
This page details my current research interests, which boil down to:
The second describes ongoing work which we first became interested in in 1997. Our
newer research efforts seed the EBMT memories with translations obtained from on-line
MT systems. In addition, we validate translations proposed by our system using the Web
as a large corpus of target strings. We have also recently experimented using controlled
language data.
The third area of interest describes efforts to produce automatically an f-structure
bank for each sentence in the Penn-II Treebank using annotated PCFGs. These datasets
could ultimately be of use in LFG-DOP-based approaches to language and translation modelling.
We hope to make available the fuill set of Penn-II f-structures in 2004. We have
also ported our method to German, and have extracted f-structures from the TIGER treebank.
Finally, using the PCFGs extracted from Penn-II, we can now tag, parse and generate f-structures for any sentence of English.
Newer work I am just beginning involves:
I've (finally!) put my publications on-line.
The first of these essentially describes my thesis work together with recent extensions
both on parameter estimation and testing out the theoretical ideas on large datasets.
The work on the first of these, with postgraduate student Declan Groves, involves combining insights from EBMT and PCFG parsing into statistical models of translation. The second, with postgraduate students Anna Khasin and Bart Mellebeek, and Josef van Genabith involves tricking existing, wide-coverage, commercially available MT systems into producing better translations. The project uses existing technology and tries to "spoon-feed" MT systems to achieve improved translations.
Robust, Hybrid Machine Translation
We have developed LFG-DOT models for Machine Translation (MT) based on Data-Oriented Parsing (DOP) allied to the syntactic representations of Lexical Functional Grammar (LFG).We begin by showing that in themselves, none of the main paradigmatic approaches to MT currently suffice to the standard required. Nevertheless, each of these approaches contains elements which if properly harnessed should lead to an overall improvement in translation performance. It is in this new hybrid spirit that our search for a better solution to the problems of MT can be seen.
We summarise the original DOP model (Bod 1992, 1995, 1998, 2003), as well as the DOT model of translation on which it is based (Poutsma 1998, 2000). We demonstrate that DOT is not guaranteed to produce the correct translation, despite provably deriving the most probable translation. We go on to critically evaluate previous attempts at LFG-MT, commenting briefly on particular problem cases for such systems. We then show how the LFG-DOP model of Bod & Kaplan (1998) can be extended to serve as a novel hybrid model for MT which promises to improve upon DOT, as well as the pure LFG-based translation model.
This work, extending ideas in my thesis, is currently being undertaken with a PhD student of mine, Mary Hearne. See Mary's link for recent papers on this topic at MT Summit 2003, and at
IJCNLP 2004.
An introductory paper on this work appeared in a special issue of JETAI on Memory-Based Learning in 1999. More specific information on this special issue can be found here.
Papers on translation using DOP and LFG-DOP have been/will be presented at:
For recent overviews of this work, see also Chapter 16 of my recent new book (with Michael Carl) on EBMT, and Chapter 19 of Bod, R; Scha, R; Sima'an, K (editors) Data-Oriented Parsing CSLI Publications 2003.
Example-Based Machine Translation using the Marker Hypothesis
Example-Based Machine Translation (EBMT) is an empirical approach to the problems of Machine Translation (MT) which tries to translate new strings on the basis of previously seen examples stored in the system's databases. As a very basic example, we might try to translate the sentence "I went to the baker's" into French with recourse to the following examples stored in the system's memory:
The nature of the examples stored in memory may differ considerably: they may be pure
strings, or tagged with POS- or HTML-tags, or they may be aligned PS-trees. Even the
LFG-DOT approach above may be considered an instance of EBMT: this time, we are storing
not only aligned
We have been conducting Marker-Based EBMT in DCU since 1997. The Marker Hypothesis
(Green, 1979) is a universal psycholinguistic constraint which
states that natural languages are 'marked' for complex syntactic
structure at surface form by a closed set of specific lexemes and
morphemes. That is, a basic phrase-level segmentation of an input
sentence can be achieved by exploiting a closed list of known marker
words to signal the start and end of each segment. Consider
the following example, selected at random from the Wall Street
Journal section of the Penn-II Treebank:
The first paper we published was:
Automatic Compilation of Linguistic Corpora from Treebanks
Here we see that three noun phrases start with determiners and one
with a possessive pronoun. The sets of determiners and possessive pronouns are
both very small. Furthermore, there are four prepositional phrases, and
the set of prepositions is similarly small. A further assumption that
could be made is that all words which end with '-ed' are verbs, such
as 'stopped in 'Dearborn'. The Marker Hypothesis is
arguably universal in presuming that concepts and structures like these
have similar morphological or structural marking in all languages.
I'm currently working on this area of EBMT with a PhD student of mine, Nano Gough. We have published some more recent papers on Marker-Based EBMT, including:
Readers interested in EBMT in general should consult my new book (with Michael Carl)!
For much more information on this work, check out our Treebank page.
January, 2004