ACL3 Project

Probabilistic Context-Free Parsing

 

 

 

 

 

 

 

Aoife Cahill: 97093246

 

Supervisors:   Josef van Genabith

Fintan Costello


Probabilistic Context-Free Parsing

1.      Introduction

In the space of the last ten years, statistical methods have gone from being virtually unknown in computational linguistics to being a fundamental given. [1] The use of statistical methods has greatly furthered progress in many areas of language processing. Disambiguation, database classification of documents, speech recognition and grammar learning are just some of the areas that statistics has helped. We now have several large electronically available corpora, from which we can extract the statistical information that can help us understand language. These corpora are usually annotated and we can use these annotations to our advantage. Due to the sheer volume of text and data in these corpora, we can assume certain regularities of a language and try to describe the features of a language by observing frequent characteristics. Statistics is the ideal tool for analysing these frequently occurring phenomena.

            One reason statistics holds such high esteem among computational linguists is because the language being analysed has actually been written or spoken by natives of the language, and does not rely on fabricated examples produced by linguists. It is "real data" as spoken and written by natives. This would suggest that by using statistics that we get a more accurate view of a language. Up until now there have been problems with either over-generation or under-generation. A more robust method of language processing was needed. Statistics offers this robustness, since it allows for a best guess in case of uncertainty. [3] Of course, as with most theories, there are disadvantages to using statistics. Sparse data is probably the main problem that computational linguists face when trying to use statistics to describe language. The results we achieve from analysing corpora depend heavily on the type of data found in the training corpus and the following test corpus. There is no perfect data collection, nor any perfect test suite. [4] Therefore when we use statistics, we will always get only an approximation of the language.

I intend to show how statistics are used in studying language, in particular probabilistic context free grammars, and how these grammars can be used to produce statistical parsers. I would also like to show the advantages and disadvantages of using statistics in parsing, and to explore some of the methods to optimise parsing methods by using statistics. I will extract a probabilistic context-free grammar from a parsed version of the Penn Treebank using Perl. I will then discuss possible methods of using this grammar in a parser.

 

 


2.      Hidden Markov Models

Hidden Markov Models were one of the first implementations of statistical methods for linguistics. The most common use of HMMs today is in speech processing. The two most common forms of calculating probabilities of sentences are the bigram and trigram models. A Bigram model is one in which it is assumed that the previous word affects the probability of the next word. A Trigram model assumes that the previous two words affect the probability of the next word. The more general case is of an n-gram model, in which it is assumed that the previous n-1 words have an effect on the probability of the next word. The reason bigram and trigram methods are used is because to calculate the probability of the string as a whole, usually works out to be unreasonable due to sparse data.

0.6

 
Markov processes were developed by Andrei A. Markov, with the purpose of modeling the letter sequence in works of Russian literature (Markov 1913). [9] A Markov Chain is a special case of a finite state automaton, where there are probabilities associated with each arc. The numbers associated with all of the arcs leaving a state must sum to one. A FSA can be thought of as either an acceptor or a generator, and so can a Markov Chain. [4] Here is a simple example of what a Markov Chain might look like.

0.4

 

0.3

 

h

 

p

 
                                                     

0.3

 

0.6

 

1.0

 

 a

 
                                           

1.0

 

 t

 

i

 

0.4

 
                                                                 

 e

 

0.4

 

                                                                           

 

Start                                                                 Example taken from [9]

 

To calculate the probability of a given string, for example P(t,i,p), just multiply the probabilities on the arcs.

 

So, the probability of the sequence (t,i,p) according to the above model, would be:

P(t,i,p)        = P(X1=t)P(X2=i|X1=t)P(X3=p|X2=i)

                  =1.0 x 0.3 x 0.6

= 0.18

Formally a HMM is defined as a four-tuple <s1, S, W, E> where S is a set of states, s1ÎS is the initial state of the model, W is a set of output symbols and E is a set of edges or transitions. These transitions are in turn four-tuples, <si, sj, wk, p>, where si is the state the transition starts from, sj is the state the transition goes to, wk is an output symbol and p is the probability of taking the transition. [4] Manning and Schütze [9] describe a HMM as a five-tuple (S, K, P, A, B), where S and K are the set of states and the output alphabet, and P, A and B are the probabilities for the initial state, state transitions, and symbol emissions respectively. Markov models assume that the only information affecting the probability of an output, or of the next state, is the prior state. This is called the Markov assumption. A Hidden Markov Model is so called because from the input, we cannot deduce the state sequence followed by a HMM, this sequence is hidden.

 

3.      Using HMMs

Two of the most common uses of HMMs are estimating the probability of a string and finding the most likely path through a HMM given an input. The second situation is often used to estimate Part-of-Speech tags.


Sparse data often causes a problem for estimating the probability of an English sentence, therefore a “smoothed” trigram model is preferred. A smoothed model is needed to deal with zero probabilities for less common trigrams. The most efficient method of smoothing is to add weighted terms for unigrams and bigrams into the probability assignment, giving the following equation:

[4]                                                                                                         l1,l2 and l3 must sum to 1

 

The Viterbi Algorithm

The Viterbi Algorithm is used for finding the most probable path through a HMM given a particular output sequence. It can be efficiently implemented using a trellis. [6] A trellis, sometimes called a lattice, is a square array. In the case of the Viterbi Algorithm, this trellis contains a square array of state versus time, and the probabilities for being in each state at each time in terms of the probabilities for being in each state at the preceding time instant [9]

Text Box: Given word sequence w1…wT, lexical categories L1,...Ln, lexical probabilities P (wt|Li), and bigram probabilities P(Li|Lj), find the most likely sequence of lexical categories C1,...CT for the word sequence.
Initialisation Step:
For i = 1 to N do
	SEQSCORE(i,1) = P(w1|Li)* P(Li|Ć)
	BACKPTR(i,1) = 0

Iteration Step:
For t=2 to T
	For i= 1 to N
		SEQSCORE(i,t)= MAXj=1,N (SEQSCORE(j,t-1)* P(Li|Lj)) * P(wt|Li)
		BACKPTR(i,t) = index of j that gave the max above

Sequence Identification Step:
C(T) = i that maximises SEQSCORE(i,t)
For i = T-1 to 1 do
	C(i) = BACKPTR(C(i+1),i+1)
[2]

 


Example calculation using the Viterbi Algorithm taken from Chapter 7 of “Natural Language Understanding” [2]

We have a table of words, used to make the ambiguous string “Flies like a flower”. The figures are taken from an artificially generated corpus containing 300 sentences and which only has words in four categories: N, V, ART, and P. [2] The table also contains probabilities P (Word|Category) which represents the probability that Word is of type Category in the corpus.


 


Word

N

PROB(Word|N)

V

PROB(

Word|V)

ART

PROB(

Word|ART)

P

PROB(Word|P)

TOTAL

Flies

21

0.025

23

0.076

0

0

0

0

44

like

10

0.012

30

0.1

0

0

21

0.068

55

a

1

0.001

0

0

201

0.36

0

0

202

flower

53

0.064

15

0.05

0

0

0

0

68

others

748

232

357

286

1629

total

833

300

558

307

1998

 

We also need a table of bigram probabilities from the corpus. The bigrams indicate the estimated probability that  the first category will occur, given that the word is preceded by a word with the second category. Any entry not found in the table is assumed to be 0.0001 [2]

 

Bigram

Estimate

P(ART|Ć)

0.71

P(N|Ć)

0.29

P(N|ART)

1

P(V|N)

0.43

P(N|N)

0.13

P(P|N)

0.44

P(N|V)

0.35

P(ART|V)

0.65

P(ART|P)

0.74

P(N|P)

0.26

 

The algorithm calculates the probabilities of the best sequence leading to each category at each position using an N*T array, where N is the number of lexical categories and T is the number of words in the string. In this example, this is a 4x4 array (trellis). The first step in the algorithm is to initialise the trellis by calculating the probability P(Cati |Ć) * P(W1|Cati) for each category. The following steps extend the trellis one word at a time, keeping track of the best sequence at each stage. Once all the words have been processed, the path that led to the best probability can be read from the BACKPTR array. [2]

 

The trellis after the first step of the Viterbi Algorithm looks something like: