ACL3 Project
Probabilistic Context-Free Parsing
Aoife Cahill:
97093246
Supervisors: Josef van Genabith
Fintan Costello
Probabilistic Context-Free Parsing
1.
Introduction
In the space of the last ten years, statistical
methods have gone from being virtually unknown in computational linguistics to
being a fundamental given. [1] The use of statistical methods has greatly
furthered progress in many areas of language processing. Disambiguation,
database classification of documents, speech recognition and grammar learning
are just some of the areas that statistics has helped. We now have several
large electronically available corpora, from which we can extract the
statistical information that can help us understand language. These corpora are
usually annotated and we can use these annotations to our advantage. Due to the
sheer volume of text and data in these corpora, we can assume certain
regularities of a language and try to describe the features of a language by
observing frequent characteristics. Statistics is the ideal tool for analysing
these frequently occurring phenomena.
One reason statistics holds such high
esteem among computational linguists is because the language being analysed has
actually been written or spoken by natives of the language, and does not rely
on fabricated examples produced by linguists. It is "real data" as
spoken and written by natives. This would suggest that by using statistics that
we get a more accurate view of a language. Up until now there have been problems
with either over-generation or under-generation. A more robust method of
language processing was needed. Statistics offers this robustness, since it
allows for a best guess in case of uncertainty. [3] Of course, as with most
theories, there are disadvantages to using statistics. Sparse data is probably
the main problem that computational linguists face when trying to use
statistics to describe language. The results we achieve from analysing corpora
depend heavily on the type of data found in the training corpus and the
following test corpus. There is no perfect data collection, nor any perfect
test suite. [4] Therefore when we use statistics, we will always get only an
approximation of the language.
I intend to show how statistics are used in studying language, in particular probabilistic context free grammars, and how these grammars can be used to produce statistical parsers. I would also like to show the advantages and disadvantages of using statistics in parsing, and to explore some of the methods to optimise parsing methods by using statistics. I will extract a probabilistic context-free grammar from a parsed version of the Penn Treebank using Perl. I will then discuss possible methods of using this grammar in a parser.
2.
Hidden Markov
Models
Hidden Markov Models were one of the first
implementations of statistical methods for linguistics. The most common use of
HMMs today is in speech processing. The two most common forms of calculating
probabilities of sentences are the bigram and trigram models. A Bigram model is
one in which it is assumed that the previous word affects the probability of
the next word. A Trigram model assumes that the previous two words affect the
probability of the next word. The more general case is of an n-gram model, in
which it is assumed that the previous n-1 words have an effect on the
probability of the next word. The reason bigram and trigram methods are used is
because to calculate the probability of the string as a whole, usually works
out to be unreasonable due to sparse data.
0.6
Markov processes were developed by Andrei A.
Markov, with the purpose of modeling the letter sequence in works of Russian
literature (Markov 1913). [9] A Markov Chain is a special case of a finite
state automaton, where there are probabilities associated with each arc. The
numbers associated with all of the arcs leaving a state must sum to one. A FSA
can be thought of as either an acceptor or a generator, and so can a Markov
Chain. [4] Here is a simple example of what a Markov Chain might look like.
0.4 0.3
![]()
![]()
![]()
![]()
![]()
h p

![]()
![]()
![]()
0.3 0.6 1.0 a
1.0 t i 0.4
![]()
![]()
![]()
e 0.4
![]()
![]()
![]()
Start Example taken from [9]
To calculate the probability of a given string,
for example P(t,i,p), just multiply the probabilities on the arcs.

So, the probability of the sequence (t,i,p) according
to the above model, would be:
P(t,i,p) =
P(X1=t)P(X2=i|X1=t)P(X3=p|X2=i)
=1.0
x 0.3 x 0.6
= 0.18
Formally a HMM is defined as a four-tuple <s1,
S, W, E> where S is a set of states, s1ÎS is
the initial state of the model, W is a set of output symbols and E is a set of
edges or transitions. These transitions are in turn four-tuples, <si,
sj, wk, p>, where si is the state the
transition starts from, sj is the state the transition goes to, wk
is an output symbol and p is the probability of taking the transition. [4]
Manning and Schütze [9] describe a HMM
as a five-tuple (S, K, P, A, B), where S and K are the set of states and the
output alphabet, and P, A and B are the probabilities for the initial state,
state transitions, and symbol emissions respectively. Markov models assume that
the only information affecting the probability of an output, or of the next
state, is the prior state. This is called the Markov assumption. A Hidden
Markov Model is so called because from the input, we cannot deduce the state
sequence followed by a HMM, this sequence is hidden.
3.
Using HMMs
Two of the most common uses of HMMs are
estimating the probability of a string and finding the most likely path through
a HMM given an input. The second situation is often used to estimate
Part-of-Speech tags.
![]()
Sparse data often causes a problem for estimating the probability of an
English sentence, therefore a “smoothed” trigram model is preferred. A smoothed
model is needed to deal with zero probabilities for less common trigrams. The
most efficient method of smoothing is to add weighted terms for unigrams and
bigrams into the probability assignment, giving the following equation:
[4] l1,l2 and
l3
must sum to 1
The
Viterbi Algorithm
The Viterbi Algorithm is used for finding the
most probable path through a HMM given a particular output sequence. It can be
efficiently implemented using a trellis. [6] A trellis, sometimes called a
lattice, is a square array. In the case of the Viterbi Algorithm, this trellis
contains a square array of state versus time, and the probabilities for being
in each state at each time in terms of the probabilities for being in each
state at the preceding time instant [9]
![Text Box: Given word sequence w1…wT, lexical categories L1,...Ln, lexical probabilities P (wt|Li), and bigram probabilities P(Li|Lj), find the most likely sequence of lexical categories C1,...CT for the word sequence.
Initialisation Step:
For i = 1 to N do
SEQSCORE(i,1) = P(w1|Li)* P(Li|Ć)
BACKPTR(i,1) = 0
Iteration Step:
For t=2 to T
For i= 1 to N
SEQSCORE(i,t)= MAXj=1,N (SEQSCORE(j,t-1)* P(Li|Lj)) * P(wt|Li)
BACKPTR(i,t) = index of j that gave the max above
Sequence Identification Step:
C(T) = i that maximises SEQSCORE(i,t)
For i = T-1 to 1 do
C(i) = BACKPTR(C(i+1),i+1)
[2]](./PCFG_files/image022.gif)
Example calculation using the Viterbi Algorithm
taken from Chapter 7 of “Natural Language Understanding” [2]
We have a table of words, used to make the ambiguous string “Flies like a flower”. The figures are taken from an artificially generated corpus containing 300 sentences and which only has words in four categories: N, V, ART, and P. [2] The table also contains probabilities P (Word|Category) which represents the probability that Word is of type Category in the corpus.
Word |
N |
PROB(Word|N) |
V |
PROB( Word|V) |
ART |
PROB( Word|ART) |
P |
PROB(Word|P) |
TOTAL |
|
|
Flies |
21 |
0.025 |
23 |
0.076 |
0 |
0 |
0 |
0 |
44 |
|
|
like |
10 |
0.012 |
30 |
0.1 |
0 |
0 |
21 |
0.068 |
55 |
|
|
a |
1 |
0.001 |
0 |
0 |
201 |
0.36 |
0 |
0 |
202 |
|
|
flower |
53 |
0.064 |
15 |
0.05 |
0 |
0 |
0 |
0 |
68 |
|
|
others |
748 |
232 |
357 |
286 |
1629 |
|||||
|
total |
833 |
300 |
558 |
307 |
1998 |
|||||
We also need a table of bigram probabilities from the
corpus. The bigrams indicate the estimated probability that the first category will occur, given that
the word is preceded by a word with the second category. Any entry not found in
the table is assumed to be 0.0001 [2]
|
Bigram |
Estimate |
|
P(ART|Ć) |
0.71 |
|
|
0.29 |
|
P(N|ART) |
1 |
|
P(V|N) |
0.43 |
|
P(N|N) |
0.13 |
|
P(P|N) |
0.44 |
|
P(N|V) |
0.35 |
|
P(ART|V) |
0.65 |
|
P(ART|P) |
0.74 |
|
P(N|P) |
0.26 |
The algorithm calculates the probabilities of the best
sequence leading to each category at each position using an N*T array, where N
is the number of lexical categories and T is the number of words in the string.
In this example, this is a 4x4 array (trellis). The first step in the algorithm
is to initialise the trellis by calculating the probability P(Cati |Ć) * P(W1|Cati) for each category. The following
steps extend the trellis one word at a time, keeping track of the best sequence
at each stage. Once all the words have been processed, the path that led to the
best probability can be read from the BACKPTR array. [2]
The trellis after the first step of the Viterbi
Algorithm looks something like: