ACL3 Project
Probabilistic Context-Free Parsing
Aoife Cahill:
97093246
Supervisors: Josef van Genabith
Fintan Costello
Probabilistic Context-Free Parsing
1.
Introduction
In the space of the last ten years, statistical
methods have gone from being virtually unknown in computational linguistics to
being a fundamental given. [1] The use of statistical methods has greatly
furthered progress in many areas of language processing. Disambiguation,
database classification of documents, speech recognition and grammar learning
are just some of the areas that statistics has helped. We now have several
large electronically available corpora, from which we can extract the
statistical information that can help us understand language. These corpora are
usually annotated and we can use these annotations to our advantage. Due to the
sheer volume of text and data in these corpora, we can assume certain
regularities of a language and try to describe the features of a language by
observing frequent characteristics. Statistics is the ideal tool for analysing
these frequently occurring phenomena.
One reason statistics holds such high
esteem among computational linguists is because the language being analysed has
actually been written or spoken by natives of the language, and does not rely
on fabricated examples produced by linguists. It is "real data" as
spoken and written by natives. This would suggest that by using statistics that
we get a more accurate view of a language. Up until now there have been problems
with either over-generation or under-generation. A more robust method of
language processing was needed. Statistics offers this robustness, since it
allows for a best guess in case of uncertainty. [3] Of course, as with most
theories, there are disadvantages to using statistics. Sparse data is probably
the main problem that computational linguists face when trying to use
statistics to describe language. The results we achieve from analysing corpora
depend heavily on the type of data found in the training corpus and the
following test corpus. There is no perfect data collection, nor any perfect
test suite. [4] Therefore when we use statistics, we will always get only an
approximation of the language.
I intend to show how statistics are used in studying language, in particular probabilistic context free grammars, and how these grammars can be used to produce statistical parsers. I would also like to show the advantages and disadvantages of using statistics in parsing, and to explore some of the methods to optimise parsing methods by using statistics. I will extract a probabilistic context-free grammar from a parsed version of the Penn Treebank using Perl. I will then discuss possible methods of using this grammar in a parser.
2.
Hidden Markov
Models
Hidden Markov Models were one of the first
implementations of statistical methods for linguistics. The most common use of
HMMs today is in speech processing. The two most common forms of calculating
probabilities of sentences are the bigram and trigram models. A Bigram model is
one in which it is assumed that the previous word affects the probability of
the next word. A Trigram model assumes that the previous two words affect the
probability of the next word. The more general case is of an n-gram model, in
which it is assumed that the previous n-1 words have an effect on the
probability of the next word. The reason bigram and trigram methods are used is
because to calculate the probability of the string as a whole, usually works
out to be unreasonable due to sparse data.
0.6
Markov processes were developed by Andrei A.
Markov, with the purpose of modeling the letter sequence in works of Russian
literature (Markov 1913). [9] A Markov Chain is a special case of a finite
state automaton, where there are probabilities associated with each arc. The
numbers associated with all of the arcs leaving a state must sum to one. A FSA
can be thought of as either an acceptor or a generator, and so can a Markov
Chain. [4] Here is a simple example of what a Markov Chain might look like.
0.4 0.3
![]()
![]()
![]()
![]()
![]()
h p

![]()
![]()
![]()
0.3 0.6 1.0 a
1.0 t i 0.4
![]()
![]()
![]()
e 0.4
![]()
![]()
![]()
Start Example taken from [9]
To calculate the probability of a given string,
for example P(t,i,p), just multiply the probabilities on the arcs.

So, the probability of the sequence (t,i,p) according
to the above model, would be:
P(t,i,p) =
P(X1=t)P(X2=i|X1=t)P(X3=p|X2=i)
=1.0
x 0.3 x 0.6
= 0.18
Formally a HMM is defined as a four-tuple <s1,
S, W, E> where S is a set of states, s1ÎS is
the initial state of the model, W is a set of output symbols and E is a set of
edges or transitions. These transitions are in turn four-tuples, <si,
sj, wk, p>, where si is the state the
transition starts from, sj is the state the transition goes to, wk
is an output symbol and p is the probability of taking the transition. [4]
Manning and Schütze [9] describe a HMM
as a five-tuple (S, K, P, A, B), where S and K are the set of states and the
output alphabet, and P, A and B are the probabilities for the initial state,
state transitions, and symbol emissions respectively. Markov models assume that
the only information affecting the probability of an output, or of the next
state, is the prior state. This is called the Markov assumption. A Hidden
Markov Model is so called because from the input, we cannot deduce the state
sequence followed by a HMM, this sequence is hidden.
3.
Using HMMs
Two of the most common uses of HMMs are
estimating the probability of a string and finding the most likely path through
a HMM given an input. The second situation is often used to estimate
Part-of-Speech tags.
![]()
Sparse data often causes a problem for estimating the probability of an
English sentence, therefore a “smoothed” trigram model is preferred. A smoothed
model is needed to deal with zero probabilities for less common trigrams. The
most efficient method of smoothing is to add weighted terms for unigrams and
bigrams into the probability assignment, giving the following equation:
[4] l1,l2 and
l3
must sum to 1
The
Viterbi Algorithm
The Viterbi Algorithm is used for finding the
most probable path through a HMM given a particular output sequence. It can be
efficiently implemented using a trellis. [6] A trellis, sometimes called a
lattice, is a square array. In the case of the Viterbi Algorithm, this trellis
contains a square array of state versus time, and the probabilities for being
in each state at each time in terms of the probabilities for being in each
state at the preceding time instant [9]
![Text Box: Given word sequence w1…wT, lexical categories L1,...Ln, lexical probabilities P (wt|Li), and bigram probabilities P(Li|Lj), find the most likely sequence of lexical categories C1,...CT for the word sequence.
Initialisation Step:
For i = 1 to N do
SEQSCORE(i,1) = P(w1|Li)* P(Li|Æ)
BACKPTR(i,1) = 0
Iteration Step:
For t=2 to T
For i= 1 to N
SEQSCORE(i,t)= MAXj=1,N (SEQSCORE(j,t-1)* P(Li|Lj)) * P(wt|Li)
BACKPTR(i,t) = index of j that gave the max above
Sequence Identification Step:
C(T) = i that maximises SEQSCORE(i,t)
For i = T-1 to 1 do
C(i) = BACKPTR(C(i+1),i+1)
[2]](./PCFG_files/image022.gif)
Example calculation using the Viterbi Algorithm
taken from Chapter 7 of “Natural Language Understanding” [2]
We have a table of words, used to make the ambiguous string “Flies like a flower”. The figures are taken from an artificially generated corpus containing 300 sentences and which only has words in four categories: N, V, ART, and P. [2] The table also contains probabilities P (Word|Category) which represents the probability that Word is of type Category in the corpus.
Word |
N |
PROB(Word|N) |
V |
PROB( Word|V) |
ART |
PROB( Word|ART) |
P |
PROB(Word|P) |
TOTAL |
|
|
Flies |
21 |
0.025 |
23 |
0.076 |
0 |
0 |
0 |
0 |
44 |
|
|
like |
10 |
0.012 |
30 |
0.1 |
0 |
0 |
21 |
0.068 |
55 |
|
|
a |
1 |
0.001 |
0 |
0 |
201 |
0.36 |
0 |
0 |
202 |
|
|
flower |
53 |
0.064 |
15 |
0.05 |
0 |
0 |
0 |
0 |
68 |
|
|
others |
748 |
232 |
357 |
286 |
1629 |
|||||
|
total |
833 |
300 |
558 |
307 |
1998 |
|||||
We also need a table of bigram probabilities from the
corpus. The bigrams indicate the estimated probability that the first category will occur, given that
the word is preceded by a word with the second category. Any entry not found in
the table is assumed to be 0.0001 [2]
|
Bigram |
Estimate |
|
P(ART|Æ) |
0.71 |
|
|
0.29 |
|
P(N|ART) |
1 |
|
P(V|N) |
0.43 |
|
P(N|N) |
0.13 |
|
P(P|N) |
0.44 |
|
P(N|V) |
0.35 |
|
P(ART|V) |
0.65 |
|
P(ART|P) |
0.74 |
|
P(N|P) |
0.26 |
The algorithm calculates the probabilities of the best
sequence leading to each category at each position using an N*T array, where N
is the number of lexical categories and T is the number of words in the string.
In this example, this is a 4x4 array (trellis). The first step in the algorithm
is to initialise the trellis by calculating the probability P(Cati |Æ) * P(W1|Cati) for each category. The following
steps extend the trellis one word at a time, keeping track of the best sequence
at each stage. Once all the words have been processed, the path that led to the
best probability can be read from the BACKPTR array. [2]
The trellis after the first step of the Viterbi
Algorithm looks something like:
7.6*10-6 (0.0001*0.076) 0.00725
(0.025*0.29) 0 0


![]()
![]()
Step 2:
0.00031 MAX(7.6*10-6*0.0001, 0.00725*0.43) *0.1 1.13*10-5 MAX(7.6*10-6
*
0.35,
0.00725*0.13) *0.012 0.00022 MAX(7.6*10-6 0.0001, 0.00725*0.44)*0.068 0

a/V like/V 0
![]()
![]()
Step 3:
7.2*10-5 MAX(0.00031*0.65, 0.00022*0.74)*0.36 2.6*10-9


![]()
![]()
Step 4:
4.3*10-6 0 0

The Viterbi Algorithm
correctly gives the highest probability to the sequence Flies (N) like (V) a
(ART) flower (N). But this algorithm
will not always necessarily find the most intuitive sequence. By using a
trigram model instead of the bigram model in this example, accuracy can be
greatly improved. This algorithm is most successful, when the statistics
gathered are gathered from a large corpus, which is similar to the input to be
assessed. This is a frequent problem when analysing natural language with
statistics. When a method is developed it has to be tested and trained on a
particular corpus. The style and content of this corpus greatly influences how
the new method will perform with unseen data. Natural language in its very
nature is diverse in style and content. It is important to get a balance and a
broad range of language when training a model if you want the model to perform
well on unseen data.
Forward Probability, Backward Probability and the
Forward-Backward Algorithm
If ai(t) is the probability of producing w1,t-1
while ending up in state si, it is called the forward probability.
It is so called because it calculates the probability by moving forward in the
sequence. Backward probability calculates the probability by moving backwards
in the sequence. [4] By combining these two procedures we get the
forward-backward algorithm. This is also known as the Baum-Welch Algorithm, and
is an iterative hill-climbing algorithm for finding the local maximum of a
model. It is given a training sequence and adjusts the HMM parameter
probabilities (the probabilities on the arcs) to make this training sequence as
likely as possible.
4.
Parsing Algorithms
A parser uses a parsing algorithm, along with a
grammar and dictionary, to produce a phrase structure tree that corresponds to
a given sequence of words. There is a direct correspondence between the rules
of a grammar and the structure it assigns. [13] The goal of parsing is to
describe the most efficient procedures and algorithms to determine whether a
given sentence is a valid sentence according to a given grammar, and if so,
what its structure is. The syntax of natural languages is best represented in a
hierarchical, recursive structure, or tree. This is non-linear, and therefore
HMMs are not suited to finding the structure of a sentence.
There have been numerous parsing algorithms
developed, and still more are being developed. A simple top-down parser will
parse a sentence from top to bottom. This algorithm tries to predict the end
string from the given grammar. This can be non-deterministic, as each time it
chooses the wrong path, it has to backtrack to where it made the wrong
decision. One problem associated with top-down depth-first left to right
parsers is that they sometimes get stuck in a loop, if one or more of the
grammar rules are left recursive. Another is that it is inefficient and has a
worst case exponential run-time. Bottom-up parsers start with the input string
and try to reduce the string into its grammar rules. A disadvantage of this
type of parsing is that if the grammar has empty productions (e) the parser can also get
stuck in a loop.
An active chart parser can take the advantages
of both types of parsing algorithms, and combine them to make an efficient
parser. These parsers have a type of
book keeping mechanism called charts, which store partial analyses preventing
the need to backtrack to the beginning when something goes wrong. A chart can
be visualised as a network of vertices representing points in the sentence,
linked by edges representing constituents. [13] A Chart Parser has three main
constituents, a key list, a chart and a set of edges. A chart is a set of chart
entries, each of which consists of the name of a terminal or non-terminal
symbol, the starting point of entry and the entry length. The key list is a
pushdown stack of chart entries that are waiting to be added into the chart.
The edges are rules that can be applied to chart entries to build them up into
larger entries. [4]

The CKY Algorithm
![Text Box: for j := i+1 to k-1 do
if B Î tij and C Î tjk then
nl := Gen(B,i,j)
nr := Gen(C,j,k)
l := l È{<nl,nr>}
n := <A,l >
pik := pik È {n}
return n
Recognise (x1...xn)
for k := 1 to n do
tk-1,k := {A|A à xk Î P}
for i := k – 2 downto 0 do
tik := {}
for j := i+1 to k-1 do
for all A à B C Î P do
if B Î tij and C Î tjk then
tik := tik {A}
if S Î t0n then accept, otherwise reject
[11]](./PCFG_files/image043.gif)
Cocke, Kasami and Younger developed a chart based
bottom-up parser for grammars in Chomsky-Normal form. Chomsky-Normal form is a
restricted type of CFG where all of the rewrite rules must be of the form: Aà a or AàB C,
where {A,B,C} are non-terminals and a is a terminal symbol. The chart is filled from left to right and from
bottom to top.
There are two phases to this algorithm. The first
stage is the recognition phase which constructs a table denoting the non-terminals
that derive certain substrings of the language. This ultimately allows us to
see whether the input string can be derived. The second stage uses this table
and the grammar to construct all possible derivations of the string. [ The
algorithm first tries to derive strings of the shortest length and then works
its way up through the chart. [5]
For example, given the following grammar (in CNF), to
parse the string "nvdnpdn":
|
S à NP VP |
VP à V NP |
N1 à a |
P à p |
|
NP à NP PP |
VP2 à V NP |
NP à a |
PP à p |
|
NP à D N1 |
S à V NP |
A à a |
NP à p |
|
NP à A N1 |
VP à VP PP |
D à d |
V à v |
|
N1 à A N1 |
S à VP PP |
NP à d |
VP à v |
|
NP à P NP |
VP à VP2 NP |
N1 à n |
S à v |
|
PP à P NP |
S à VP2 NP |
NP à n |
|
The first step is to fill the first row of the
chart with possible non-terminals which start with the terminal symbol in the
input string. The next step is a recursive step. For each in the nth
row there are n-1 combinations of pairs of non-terminals that can be used to
try and add a non-terminal to the cell. For example in the second row, the first
cell has only one pair of cells to check to see if by taking a non-terminal
from each a new non-terminal can be formed. In row 4, the first cell ([4,4])
has three possible pairs to choose from. These pairs are: ([1,1],[3,4]),
([2,2],[2,4]) and ([3,3],[1,4]). All the possible pair combinations of
non-terminals which form new non-terminals are added to this cell. The process
continues until all the cells have been filled. The input string has been
successfully parsed if the start symbol, in this case S, appears in the last
cell. [11]
|
|
1 n |
2 v |
3 d |
4 n |
5 p |
6 d |
7 n |
|
|
N1,NP |
V,VP,S |
D, NP |
N1, NP |
P, PP,NP |
D, NP |
N1,NP |
|
2 |
|
S |
S,VP,VP2 |
NP |
NP |
NP,PP |
NP |
|
3 |
|
|
S |
S,VP,VP2 |
NP |
NP |
PP,NP |
|
4 |
|
|
|
S |
VP,VP2,S |
NP |
NP |
|
5 |
|
|
|
|
S |
VP,VP2,S |
NP |
|
6 |
|
|
|
|
|
S |
VP,VP2,S |
|
7 |
|
|
|
|
|
|
S |
This algorithm is not so suitable for parsing because
it demands that the grammar be in CNF, which is rarely the case in practice,
although it has been proved that all grammars can be converted into CNF. The
main problem with this however is that this often leads to an exponentially
large number of rules. If this algorithm is extended to deal with normal CFGs
it becomes much more inefficient and the runtime increases exponentially. [11]
The Earley Algorithm
An Earley parser is essentially a generator that
builds left-most derivations of string, using a given set of context-free
productions. The parser keeps a set of
states for each position in the input, describing all pending derivations.
These state sets form the Earley chart. A state is of the form i: kXà l·m where X is a non-terminal, l and m are
strings of non-terminals and/or terminals and i and k are indices into the input string. The current position
is i, X was expanded starting at position k and the expansion of X proceeded
using the production X à lm, and has expanded the right
hand side up to the position indicated by the dot. A state with a dot to the
right hand side of the entire right hand side is called a complete state. There
are three types of operations that an Earley parser will carry out, prediction,
scanning and completion. A prediction corresponds to a potential expansion of a
non-terminal in a left-most derivation. Scanning moves the dot over the current
symbol, if it finds a match in the current input. Completion corresponds to the
end of a non-terminal expansion started by a prediction step. [12]
5.
Probabilistic Context-Free Parsing
Context-free grammars are widely used as models of natural language syntax. In their probabilistic version, which defines a language as a probability over strings, they have been used in a variety of applications. [12]
The
simplest way to gather statistical information about a grammar is to count the
number of times each rule is used in a corpus containing parsed sentences and
use this to estimate the probability of each rule being used. [2] The simplest probabilistic model for recursive embedding is a Probabilistic (or Stochastic) Context-Free Grammar. This is
basically a CFG with probabilities associated with each rule, indicating how
probable a rewrite rule is. The algorithms used for PCFGs are a natural
development from the algorithms employed with HMMs. [9]
Formally
Charniak [4] describes a probabilistic context-free grammar as a four-tuple
<W, N, N1, R> where W is a set of terminal symbols {w1,…ww}, N is a set of
non-terminal symbols {N1,…Nn}, N1 is the
starting symbol, and R is a set of rules, each of which is of the form Ni à zj where
zj is a
string of terminals
and non-terminals. Each rule has the probability P (Ni à zj)
A non-terminal dominates
constituents if in a derivation, all of the constituents are derived (directly
or indirectly) from the non-terminal.
dominates

![]()
![]()
To find the probability of a tree in a PCFG
model, just multiply the probabilities of the rules that built its local
sub-trees. [9] The probability of a string is defined as the sum of the
probabilities of the parse trees that have this string as a yield. The
probability of a parse tree is defined as the probability of the corresponding
leftmost derivation. [6] Since the number of parses grows exponentially with
the length of a sentence, an algorithm similar to the Viterbi algorithm can be
used to find the most probable parse for a sentence.
There are many advantages to using PCFGs over
normal CFGs to determine the structure of a sentence. As grammars get larger
and larger, they become more and more ambiguous. A PCFG will give an indication
of the most likely parse. It can also be programmed to give the n most probable
structures, which can be useful when sentences are ambiguous to the reader.
Often CFGs return structures, which intuitively a human would never have
chosen. For example, Martin et al. (1987) report their system giving 455 parses
for the sentence: “List the sales of the products produced in 1973 with the
products produced in 1972”. [3,8] While it is obvious that there is a certain
degree of ambiguity due to the PP attachments, a human would more than likely
not have found many of the computer-generated parses, simply because we can
rule out certain combinations of words purely on the basis of our common sense.
This indicates how a probabilistic grammar would be useful in simply suggesting
the more probable parses, instead of a human having to go through each of the
455 parses to determine whether they were reasonable or not.
PCFGs do not, however, take context or lexical
co-occurrence into consideration. This means that sometimes some of the parses
suggested by a PCFG may not be the best, because it will have based its
plausibility purely on syntactic structures.
Charniak [4] illustrates this with an example of PP attachment. PPs,
statistically, attach slightly more often to NPs than to VPs. However in the
following two sentences, it is more the meaning of the words than the structure
of sentence, that causes the PPs to be attached differently.
·
Alice bought a plant with Mary.
·
Alice bought a plant with yellow leaves.
PCFGs
tend to be robust [9]. They produce a model of a language based on real data,
and therefore do not have to worry about things like grammatical mistakes,
which occur in real-life situations. Although PCFGs have many advantages, the
main, and I believe critical, disadvantage is that context is not taken into
account at all. In fact a trigram model of a language would probably achieve
better results [4], even though it takes no account of internal structures in
the language. Manning and Schütze [9]
also point out that even structurally a PCFG is also deficient. A PCFG takes no
account of where a phrase appears in a sentence. The expanding of a subject NP
is often different to the expanding of an object NP. Many people have noted
this disadvantage and many attempts have been made to combine other theories
and PCFGs to achieve better results.
The Ƥearl and Ƥicky Projects
Ƥearl is a probabilistic Earley-style parser. It is a bottom-up chart
parser with Earley-type top-down prediction which pursues the highest-scoring
theory in the chart, where the score of a theory represents the extent to which
the context of the sentence predicts that interpretation [7] Ƥearl uses Church’s part-of-speech assignment trigram model, a simple
probabilistic unknown word model, and a conditional probability model for
grammar rules based on part-of-speech trigrams and parent rules. Ƥearl avoids the problem of not having any context information, by using
a context sensitive conditional probability CFG, where context is partly
influenced by the part-of-speech sequences. Ƥearl also uses the geometric mean to score theories, rather than just
multiply probabilities. Magerman and Mitchell [7] claim that by simply
multiplying probabilities, you are over-penalising infrequent constructs. Ƥearl also has functions to deal with unknown words and idioms. If it
cannot successfully parse a sentence, it will attempt to create a partial parse
of the sentence.
This
project is one example of how a statistical grammar can be combined with more
successful n-gram methods to parse. It parsed with an accuracy of 88% (on a
test suite of 40 sentences) which is considerably better than what a parser
using just a PCFG can do. Its successor Ƥicky also had similar results.
Ƥicky is a probabilistic
agenda-based chart parsing algorithm which uses a technique called
probabilistic prediction to predict which grammar rules are likely to lead to
an acceptable parse of the input. [8] The parser limits the edges that can be
added to the chart, to only those predicted by its probabilistic prediction
technique.
Ƥicky uses the CKY
chart-parsing algorithm. This algorithm parses efficiently and robustly in
polynomial time. Although the Earley algorithm is more efficient than the CKY
algorithm, it is not as robust when parsing ungrammatical sentences.
6.
Conclusion
Statistical methods for natural language
processing are becoming more and more widely accepted in the computational
linguistics field, because of their practicality for computation. However
Chomsky, the most famous and well-respected linguist, has argued against the
use of statistics in linguistics. He
argued that probability theory is inappropriate for formalising the notion of
grammaticality, since computing the probability of sentences from a corpus,
assigns the same low probability to all unattested sentences, grammatical and
ungrammatical ones alike. (Chomsky 1957)[9] However, I believe that statistical
methods used in NLP today are appropriate. Methods developed always try to find
a suitable probability for unseen events, this is indeed one of the challenges
of developing a successful statistical method for analysing natural language.
Abney [1] argues that many of Chomsky’s arguments against statistical models,
HMMs in particular, were against FSA, rather than the fact that they were
stochastic. I believe that statistics can provide a good insight to actual
occurrences in natural language. Due to the fact that the statistics we use are
taken from corpora of language spoken or written by native speakers, I believe
that this information can often be more useful than a theoretical attempt to
represent for example the grammar rules of a language. By adding probabilities to such rules, we
also get a better idea of which rules are more common, and which rules may be
rare. One often finds quite rare and sometimes extreme examples when linguists
are describing some phenomena of a language. While these examples might be
interesting, I think they do not give a true picture of a language, since they
give no indication of how rare or typical such an utterance might be. This is a
difference between linguistic competence and linguistic performance. I believe
that when developing an NLP system, that the emphasis should be on linguistic
performance, since human speakers of a language will react quicker to
utterances they are likely to say or have heard themselves.
In my opinion PCFGs give a much better overall
view of a language. However, on their
own, they cannot predict the structure of a sentence accurately enough. This is
simply because, as their name suggests, they do not take context into
consideration. This implies that they have to be combined with other methods of
analysis to be effective. Charniak (1996) investigates how vital lexical
information is to parsing. He induced a maximum likelihood PCFG from the Penn
Treebank, made some small modifications and used this to parse some unseen
sentences. His results were surprisingly good. Manning and Schütze [9] argue that this is maybe because many
parsing decisions are mundane after all and can be handled by an unlexicalised
PCFG.
I believe that the efficiency and accuracy can
be improved by combining a simple probabilistic parser with other methods,
similar to those outlined in the Ƥearl
and Ƥicky Projects. I have
written a short Perl program to extract a PCFG from the Penn Treebank. This
program outputs a list of grammar rules found, a list of lexical items found,
and a list of the probabilities associated with each rule found. Many of the
rules are quite flat, and do not contain all the parts of speech. For my third
year implementation, I attempted to assign parts of speech to some of the words
by assuming that an untagged word in a phrase was the head of the phrase if
there had been no head found already, but this is not always correct. Next year
I intend to extract the correct part of speech tags for all untagged words in
the parsed version, from the tagged version. This will increase the accuracy of
the rules produced by the program. I intend to use this PCFG to write a
probabilistic parser. I will first write a simple probabilistic parser using
only this PCFG, and then try to develop a more complex algorithm taking more
lexical information into account, and hopefully show that by using the lexical
information the accuracy and efficiency of the parser is increased.
A sample output of the program is:
( (S
(NP-SBJ (NP Pierre Vinken)
,
(ADJP (NP 61 years)
old)
,)
(VP will
(VP join
(NP the board)
(PP-CLR as
(NP a nonexecutive
director))
(NP-TMP Nov. 29)))
.))
Output
Lexicon |
Rules |
Probabilistic Rules |
|
ADJ->old:
1 |
ADJP->NP
ADJ: 1 |
ADJP->NP
ADJ: 1 |
|
NP->61
years: 1 |
NP-SBJ->NP
ADJP : 1 |
NP-SBJ->NP
ADJP : 1 |
|
NP->Pierre
Vinken: 1 |
PP-CLR->P
NP : 1 |
PP-CLR->P
NP: 1 |
|
NP->a
nonexecutive director: 1 |
S->NP-SBJ
VP : 1 |
S->NP-SBJ
VP: 1 |
|
NP->the
board: 1 |
VP->V
NP PP-CLR NP-TMP : 1 |
VP->V
NP PP-CLR NP-TMP: 0.5 |
|
NP-TMP->Nov.
29: 1 |
VP->V
VP : 1 |
VP->V
VP : 0.5 |
|
P->as:
1 |
|
|
|
V->join:
1 |
|
|
|
V->will:
1 |
|
|
Bibliography
[1] Abney, Steven (1996) Statistical
Methods and Linguistics: In Judith Klavans and Philip Resnik, eds., The Balancing Act. MIT Press, Cambridge,
MA.
[2] Allen, James (1995) Natural
Language Understanding Benjamin/Cummings Publishing Company Inc.
[3] Bod, Rens (1995) Enriching
Linguistics with Statistics ILLC Dissertation Series 1995-14
[4] Charniak, Eugene (1993) Statistical
Language Learning: MIT Press,
Cambridge, MA
[5] Grune, Dick and Jacobs, Ceriel J.H (1990) Parsing Techniques a
Practical Guide Ellis Horwood Limited
[6] Krenn, Brigitte and Samuelssonm Christer (1997) The Linguist’s Guide to Statistics http://www.coli.uni-sb.de/~krenn/edu.html Last visited: 23rd October 2000
[7] Magerman, David M. and Marcus, Mitchell P. (1991)Ƥearl: A Probabilistic Chart
Parser in Proceedings of the European ACL Conference,
March 1991. Berlin, Germany
[8] Magerman, David M. and Weir, Carl (1992) Efficiency, Robustness and
Accuracy in Ƥicky Chart Parsing in Proceedings of the 30th Annual Meeting of the Association
for Computational Linguistics ,June 1992. Delaware, USA
[9] Manning, Christopher D and Schütze, Hinrich (1999) Foundations of Statistical
Natural Language Processing MIT Press, Cambridge, MA
[10] Martin W., Church,K and Patil, R (1987) Preliminary Analysis of a
Breadth First Parsing Algorithm: Theoretical and Experimental Results” in
L. Bolc (ed) Natural Language Parsing Systems, Springex Verlag, Berlin
[11] Schmid, Helmut
(1999) taken from lecture notes of the course Parsing I which took place in the
winter semester 1999/2000 at Universität Stuttgart, Germany
[12] Stolcke, Andreas (1993) An
Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix
Probabilities Berkley, CA (revised 1994)
[13] Winograd, Terry (1983) Language
as a Cognitive Process Volume 1: Syntax Adison-Wesley Publishing Company
#!/usr/bin/perl
###############################################################
# #
# A Project to extract a Probabilistic Context Free Grammar #
# from the Penn Treebank #
# written as part of a third year project for DCU #
# Aoife Cahill #
# August 2000 #
# #
# usage: pcfg.pl directory_name # # where directory_name is the name of the directory #
# in which the parsed files are kept #
# #
###############################################################
#read in the directory in which all the parsed files are kept
$suffix = ".prd";
%lexicon =();
%rules = ();
#read in the directory in which all the parsed files are kept
$directory = shift;
@files = get_files($directory);
foreach $file(@files){
parse ($file);
}
open(OUTLEX, ">lexicon") or die "Error: $!";
open(OUTRULE, ">rules") or die "Error: $!";
open(OUTPROB, ">probs") or die "Error: $!";
foreach $key (sort keys(%lexicon)){
print OUTLEX "$key: $lexicon{$key}\n";
}
foreach $key(sort keys(%rules)){
print OUTRULE "$key: $rules{$key}\n";
}
get_probs("rules");
foreach $key(sort keys(%probabilities)){
print OUTPROB "$key: $probabilities{$key}\n";
}
close(OUTLEX) or die "Error : $!";
close(OUTRULE) or die "Error : $!";
close(OUTPROB) or die "Error : $!";
#####################################
# Sub-routine get_files($) #
# #
# gets a list of files that provide #
# the data for the pcfg #
#####################################
sub get_files($){
my $dir = shift;
my @listing = glob "$dir/*";
if (@listing > 0){
foreach $file(@listing){
if(-f $file){
#only add .prd files to list
if ($file =~ /\Q${suffix}\E$/){
push @parsed_files, $file;
}
}
elsif (-d $file){
get_files($file);
}
else {
#not interested in directory element
}
}
}
return @parsed_files;
}
############################################
# Sub-routine parse($) #
# #
# Extracts grammar rules and their #
# frequencies from each data file #
############################################
sub parse($){
my @string = ();
$file = shift;
print "processing file : $file\n";
open(FILE,$file) or die "Error : $!";
while (<FILE>){
chomp;
if ($_ =~ /^\s/){
s/^\s+//;
my $str = clean($_);
push @string, $str;
}
elsif (@string > 0) {
process_string(@string);
@string = ();
my $str = clean($_);
push @string, $str;
}
else {
my $str = clean($_);
push @string, $str;
}
}
#process the last sentence
process_string(@string);
}
###################################
# Sub-routine process_string($) #
# #
# puts the sentence into one #
# string variable #
###################################
sub process_string($){
my @string = @_;
my $sentence = "";
for($i=0;$i<@string;$i++){
$sentence = $sentence.$string[$i];
}
$sentence = substr($sentence,1,-1); #get rid of extra ()
parse_string($sentence);
}
###################################
# Sub-Routine parse_string($) #
# #
# gets the grammar rules from one #
# sentence and increments freq. #
###################################
sub parse_string($){
my $string = shift;
my @categories = @_;
my $num = @categories;
if ($string =~/^[0\@\*]/){
#this is a form of null element, ignore it
#starts with a 0 or @
pop @categories;
my $rest = $';
parse_string($rest,@categories);
}
elsif ($string =~ /^\(([A-Z]+([-|\|][A-Z]+)*)([-=]?[0-9]+)? ?/){
# new category... add category to previous rule
# and start new rule for this category...
#CAT(
my $category = $1;
if ($num > 0){
$categories[$num-1] = "$categories[$num-1]"."$category"." ";
}
my $new_rule = "$category->";
push @categories, $new_rule;
parse_string($',@categories);
}
elsif ($string =~ /^([A-Za-z0-9\$\£\%\#]+[A-Za-z0-9\- \.\&\%\$\/\#\\]*)\)/){
#word)
my $value = $1;
my $rest = $';
my $category = $categories [$num-1];
if ($category =~ /([A-Z\-]+)P\-\>[A-Z\- ]+$/){
#not all words are given categories.... category assumed
#to be that of mother
#e.g. (ADJP(NP 61 years)old)
#old assumed to be an ADJ
my $type = $1;
my $rule = "$type->$value";
add_to("lexicon",$rule);
$category = $categories[$num-1].$type ;
pop @categories;
add_to("rules",$category);
}
else{
if ($num >0){
$categories[$num-1] = $categories[$num-1].$value;
my $rule = pop @categories;
add_to("lexicon",$rule);
}
}
parse_string($rest,@categories);
}
elsif ($string =~ /^([A-Za-z0-9\$\#]+[A-Za-z0-9\-\.\&\%\#\$\/\\\s]*)([A-Za-z0-9 ]+)?\(/){
#if a word appears with no tag, it is assumed to have the
#type of the main category.
#eg. (VP will(VP join
#will and join are taken to be verbs
#word(
my $value = $1;
if ($num > 0){
my $category = $categories[$num-1];
if ($category =~ /([A-Z]+)P(-[A-Z]+)?->/){
my $type = $1;
my $rhs = $';
if ($rhs =~ /\b$type[P]?/){
$categories [$num-1] = $categories[$num-1].$value." ";
}
else{
my $rule = "$type->$value";
add_to("lexicon",$rule);
$categories[$num-1] = "$categories[$num-1]"."$type"." ";
}
}
else {
$categories[$num-1] = $categories[$num-1].$value." ";
}
}
my $rest = "($'";
parse_string($rest,@categories);
}
elsif ($string =~ /^\)/){
if ($num > 0){
my $rule = pop @categories;
#don't add empty productions
if ($rule){
if ($rule =~ /->[A-Za-z\$\#0-9]/){
add_to("rules",$rule);
}
}
}
parse_string($',@categories);
}
elsif ($string eq ""){
# print "end of sentence\n";
# print "@categories\n";
}
elsif ($string=~ /^[\.\s\-\&=]/){
parse_string($',@categories);
}
elsif ($string =~ /^\( \(/){
my $rest = "($'";
parse_string($rest,@categories);
}
else {
print "$string";
print "oops.. something i haven't thought of\n";
}
}
##################################
# Sub-Routine add_to($,$) #
# #
# adds a rule to the lexicon #
# or list of rules or #
# increments its frequency count #
##################################
sub add_to ($,$){
my $hash = shift;
my $rule = shift;
$rule =~ s/\s+/ /g;
if (exists $$hash{$rule}){
$$hash{$rule}++;
}
else{
$$hash{$rule} = 1;
}
}
#####################################
# Sub-Routine clean($) #
# #
# takes out all punctuation and #
# null elements #
#####################################
sub clean($){
my $string = shift;
$string =~ s/[\'\`\?\!\"\;\:\,]//g;
$String =~ s/\s+/ /g;
$string =~ s/-[A-Z]+-//g;
$string =~ s/\*[A-Z]+\*(-[0-9]+)?//g;
$string =~ s/\*-[0-9]+//g;
$string =~ s/[A-Z]+\*//g;
return $string;
}
#####################################
# Sub-Routine get_probs($) #
# #
# assigns probabilities to each #
# rule that was found #
#####################################
sub get_probs($){
my $hash = shift;
%totals = ();
%probabilities = ();
foreach $key (keys(%$hash)){
$key =~ /\-\>/;
my $category = $` ;
if(exists $totals{$category}){
$totals{$category}+=$$hash{$key};
}
else {
$totals{$category} = $$hash{$key};
}
}
foreach $key (keys(%$hash)){
$key =~ /\-\>/;
my $category = $`;
$prob = $$hash{$key} / $totals{$category};
$probabilities{$key} = $prob;
}
}