ACL3 Project

Probabilistic Context-Free Parsing

 

 

 

 

 

 

 

Aoife Cahill: 97093246

 

Supervisors:   Josef van Genabith

Fintan Costello


Probabilistic Context-Free Parsing

1.      Introduction

In the space of the last ten years, statistical methods have gone from being virtually unknown in computational linguistics to being a fundamental given. [1] The use of statistical methods has greatly furthered progress in many areas of language processing. Disambiguation, database classification of documents, speech recognition and grammar learning are just some of the areas that statistics has helped. We now have several large electronically available corpora, from which we can extract the statistical information that can help us understand language. These corpora are usually annotated and we can use these annotations to our advantage. Due to the sheer volume of text and data in these corpora, we can assume certain regularities of a language and try to describe the features of a language by observing frequent characteristics. Statistics is the ideal tool for analysing these frequently occurring phenomena.

            One reason statistics holds such high esteem among computational linguists is because the language being analysed has actually been written or spoken by natives of the language, and does not rely on fabricated examples produced by linguists. It is "real data" as spoken and written by natives. This would suggest that by using statistics that we get a more accurate view of a language. Up until now there have been problems with either over-generation or under-generation. A more robust method of language processing was needed. Statistics offers this robustness, since it allows for a best guess in case of uncertainty. [3] Of course, as with most theories, there are disadvantages to using statistics. Sparse data is probably the main problem that computational linguists face when trying to use statistics to describe language. The results we achieve from analysing corpora depend heavily on the type of data found in the training corpus and the following test corpus. There is no perfect data collection, nor any perfect test suite. [4] Therefore when we use statistics, we will always get only an approximation of the language.

I intend to show how statistics are used in studying language, in particular probabilistic context free grammars, and how these grammars can be used to produce statistical parsers. I would also like to show the advantages and disadvantages of using statistics in parsing, and to explore some of the methods to optimise parsing methods by using statistics. I will extract a probabilistic context-free grammar from a parsed version of the Penn Treebank using Perl. I will then discuss possible methods of using this grammar in a parser.

 

 


2.      Hidden Markov Models

Hidden Markov Models were one of the first implementations of statistical methods for linguistics. The most common use of HMMs today is in speech processing. The two most common forms of calculating probabilities of sentences are the bigram and trigram models. A Bigram model is one in which it is assumed that the previous word affects the probability of the next word. A Trigram model assumes that the previous two words affect the probability of the next word. The more general case is of an n-gram model, in which it is assumed that the previous n-1 words have an effect on the probability of the next word. The reason bigram and trigram methods are used is because to calculate the probability of the string as a whole, usually works out to be unreasonable due to sparse data.

0.6

 
Markov processes were developed by Andrei A. Markov, with the purpose of modeling the letter sequence in works of Russian literature (Markov 1913). [9] A Markov Chain is a special case of a finite state automaton, where there are probabilities associated with each arc. The numbers associated with all of the arcs leaving a state must sum to one. A FSA can be thought of as either an acceptor or a generator, and so can a Markov Chain. [4] Here is a simple example of what a Markov Chain might look like.

0.4

 

0.3

 

h

 

p

 
                                                     

0.3

 

0.6

 

1.0

 

 a

 
                                           

1.0

 

 t

 

i

 

0.4

 
                                                                 

 e

 

0.4

 

                                                                           

 

Start                                                                 Example taken from [9]

 

To calculate the probability of a given string, for example P(t,i,p), just multiply the probabilities on the arcs.

 

So, the probability of the sequence (t,i,p) according to the above model, would be:

P(t,i,p)        = P(X1=t)P(X2=i|X1=t)P(X3=p|X2=i)

                  =1.0 x 0.3 x 0.6

= 0.18

Formally a HMM is defined as a four-tuple <s1, S, W, E> where S is a set of states, s1ÎS is the initial state of the model, W is a set of output symbols and E is a set of edges or transitions. These transitions are in turn four-tuples, <si, sj, wk, p>, where si is the state the transition starts from, sj is the state the transition goes to, wk is an output symbol and p is the probability of taking the transition. [4] Manning and Schütze [9] describe a HMM as a five-tuple (S, K, P, A, B), where S and K are the set of states and the output alphabet, and P, A and B are the probabilities for the initial state, state transitions, and symbol emissions respectively. Markov models assume that the only information affecting the probability of an output, or of the next state, is the prior state. This is called the Markov assumption. A Hidden Markov Model is so called because from the input, we cannot deduce the state sequence followed by a HMM, this sequence is hidden.

 

3.      Using HMMs

Two of the most common uses of HMMs are estimating the probability of a string and finding the most likely path through a HMM given an input. The second situation is often used to estimate Part-of-Speech tags.


Sparse data often causes a problem for estimating the probability of an English sentence, therefore a “smoothed” trigram model is preferred. A smoothed model is needed to deal with zero probabilities for less common trigrams. The most efficient method of smoothing is to add weighted terms for unigrams and bigrams into the probability assignment, giving the following equation:

[4]                                                                                                         l1,l2 and l3 must sum to 1

 

The Viterbi Algorithm

The Viterbi Algorithm is used for finding the most probable path through a HMM given a particular output sequence. It can be efficiently implemented using a trellis. [6] A trellis, sometimes called a lattice, is a square array. In the case of the Viterbi Algorithm, this trellis contains a square array of state versus time, and the probabilities for being in each state at each time in terms of the probabilities for being in each state at the preceding time instant [9]

Text Box: Given word sequence w1…wT, lexical categories L1,...Ln, lexical probabilities P (wt|Li), and bigram probabilities P(Li|Lj), find the most likely sequence of lexical categories C1,...CT for the word sequence.
Initialisation Step:
For i = 1 to N do
	SEQSCORE(i,1) = P(w1|Li)* P(Li|Æ)
	BACKPTR(i,1) = 0

Iteration Step:
For t=2 to T
	For i= 1 to N
		SEQSCORE(i,t)= MAXj=1,N (SEQSCORE(j,t-1)* P(Li|Lj)) * P(wt|Li)
		BACKPTR(i,t) = index of j that gave the max above

Sequence Identification Step:
C(T) = i that maximises SEQSCORE(i,t)
For i = T-1 to 1 do
	C(i) = BACKPTR(C(i+1),i+1)
[2]

 


Example calculation using the Viterbi Algorithm taken from Chapter 7 of “Natural Language Understanding” [2]

We have a table of words, used to make the ambiguous string “Flies like a flower”. The figures are taken from an artificially generated corpus containing 300 sentences and which only has words in four categories: N, V, ART, and P. [2] The table also contains probabilities P (Word|Category) which represents the probability that Word is of type Category in the corpus.


 


Word

N

PROB(Word|N)

V

PROB(

Word|V)

ART

PROB(

Word|ART)

P

PROB(Word|P)

TOTAL

Flies

21

0.025

23

0.076

0

0

0

0

44

like

10

0.012

30

0.1

0

0

21

0.068

55

a

1

0.001

0

0

201

0.36

0

0

202

flower

53

0.064

15

0.05

0

0

0

0

68

others

748

232

357

286

1629

total

833

300

558

307

1998

 

We also need a table of bigram probabilities from the corpus. The bigrams indicate the estimated probability that  the first category will occur, given that the word is preceded by a word with the second category. Any entry not found in the table is assumed to be 0.0001 [2]

 

Bigram

Estimate

P(ART|Æ)

0.71

P(N|Æ)

0.29

P(N|ART)

1

P(V|N)

0.43

P(N|N)

0.13

P(P|N)

0.44

P(N|V)

0.35

P(ART|V)

0.65

P(ART|P)

0.74

P(N|P)

0.26

 

The algorithm calculates the probabilities of the best sequence leading to each category at each position using an N*T array, where N is the number of lexical categories and T is the number of words in the string. In this example, this is a 4x4 array (trellis). The first step in the algorithm is to initialise the trellis by calculating the probability P(Cati |Æ) * P(W1|Cati) for each category. The following steps extend the trellis one word at a time, keeping track of the best sequence at each stage. Once all the words have been processed, the path that led to the best probability can be read from the BACKPTR array. [2]

 

The trellis after the first step of the Viterbi Algorithm looks something like:

7.6*10-6      (0.0001*0.076)

 

0.00725    (0.025*0.29)

 

0

 

0

 
 

 

 

 

 

 

 


Step 2:

0.00031      MAX(7.6*10-6*0.0001, 0.00725*0.43) *0.1

 

1.13*10-5        MAX(7.6*10-6 * 0.35,  0.00725*0.13) *0.012

 

0.00022        MAX(7.6*10-6  0.0001, 0.00725*0.44)*0.068

 

0

 
 

 

 

 

 

 

 

 


a/V

 

like/V

 

0

 
Step 3:

7.2*10-5                   MAX(0.00031*0.65, 0.00022*0.74)*0.36

 

2.6*10-9

 
Step 4:

4.3*10-6

 

0

 

 

0

 

 
 

 

 

 

 

 

 

 

 

 


            The Viterbi Algorithm correctly gives the highest probability to the sequence Flies (N) like (V) a (ART) flower (N).  But this algorithm will not always necessarily find the most intuitive sequence. By using a trigram model instead of the bigram model in this example, accuracy can be greatly improved. This algorithm is most successful, when the statistics gathered are gathered from a large corpus, which is similar to the input to be assessed. This is a frequent problem when analysing natural language with statistics. When a method is developed it has to be tested and trained on a particular corpus. The style and content of this corpus greatly influences how the new method will perform with unseen data. Natural language in its very nature is diverse in style and content. It is important to get a balance and a broad range of language when training a model if you want the model to perform well on unseen data.

 

Forward Probability, Backward Probability and the Forward-Backward Algorithm

            If ai(t) is the probability of producing w1,t-1 while ending up in state si, it is called the forward probability. It is so called because it calculates the probability by moving forward in the sequence. Backward probability calculates the probability by moving backwards in the sequence. [4] By combining these two procedures we get the forward-backward algorithm. This is also known as the Baum-Welch Algorithm, and is an iterative hill-climbing algorithm for finding the local maximum of a model. It is given a training sequence and adjusts the HMM parameter probabilities (the probabilities on the arcs) to make this training sequence as likely as possible.

 

4.      Parsing Algorithms

A parser uses a parsing algorithm, along with a grammar and dictionary, to produce a phrase structure tree that corresponds to a given sequence of words. There is a direct correspondence between the rules of a grammar and the structure it assigns. [13] The goal of parsing is to describe the most efficient procedures and algorithms to determine whether a given sentence is a valid sentence according to a given grammar, and if so, what its structure is. The syntax of natural languages is best represented in a hierarchical, recursive structure, or tree. This is non-linear, and therefore HMMs are not suited to finding the structure of a sentence.

There have been numerous parsing algorithms developed, and still more are being developed. A simple top-down parser will parse a sentence from top to bottom. This algorithm tries to predict the end string from the given grammar. This can be non-deterministic, as each time it chooses the wrong path, it has to backtrack to where it made the wrong decision. One problem associated with top-down depth-first left to right parsers is that they sometimes get stuck in a loop, if one or more of the grammar rules are left recursive. Another is that it is inefficient and has a worst case exponential run-time. Bottom-up parsers start with the input string and try to reduce the string into its grammar rules. A disadvantage of this type of parsing is that if the grammar has empty productions (e) the parser can also get stuck in a loop.

An active chart parser can take the advantages of both types of parsing algorithms, and combine them to make an efficient parser.  These parsers have a type of book keeping mechanism called charts, which store partial analyses preventing the need to backtrack to the beginning when something goes wrong. A chart can be visualised as a network of vertices representing points in the sentence, linked by edges representing constituents. [13] A Chart Parser has three main constituents, a key list, a chart and a set of edges. A chart is a set of chart entries, each of which consists of the name of a terminal or non-terminal symbol, the starting point of entry and the entry length. The key list is a pushdown stack of chart entries that are waiting to be added into the chart. The edges are rules that can be applied to chart entries to build them up into larger entries. [4]

Text Box: The CKY Algorithm

Parse(x1…xn)
	Recognise(x1…xn)
	if S Î t0,n then
		pik := {} for all 0 £ i £ k £ n
		return Gen(S,0,n)
	else
		return fail
Gen(A,i,k)

if $n Î pik with n = <A,l> then
	return n
else if i=k-1 then
	n:= xk
	l := {n}
else
	l := {}
	for all A à B C Î P do

The CKY Algorithm

 

 

Text Box: 		for j := i+1 to k-1 do
			if B Î tij and C Î tjk then
				nl := Gen(B,i,j)
				nr := Gen(C,j,k)
				l := l  È{<nl,nr>}
n := <A,l >
pik := pik È {n}
return n
Recognise (x1...xn)

for k := 1 to n do
	tk-1,k := {A|A à xk Î P}
	for i := k – 2 downto 0 do
		tik := {}
		for j := i+1 to k-1 do
			for all A à B C Î P do
				if B Î tij and C Î tjk then
					tik := tik {A}
if S Î t0n then accept, otherwise reject

[11]

 


Cocke, Kasami and Younger developed a chart based bottom-up parser for grammars in Chomsky-Normal form. Chomsky-Normal form is a restricted type of CFG where all of the rewrite rules must be of the form: Aà a or AàB C, where {A,B,C} are non-terminals and a is a terminal symbol. The chart is filled from left to right and from bottom to top.

There are two phases to this algorithm. The first stage is the recognition phase which constructs a table denoting the non-terminals that derive certain substrings of the language. This ultimately allows us to see whether the input string can be derived. The second stage uses this table and the grammar to construct all possible derivations of the string. [ The algorithm first tries to derive strings of the shortest length and then works its way up through the chart. [5]

 

For example, given the following grammar (in CNF), to parse the string "nvdnpdn":

S à NP VP

VP à V NP

N1 à a

P à p

NP à NP PP

VP2 à V NP

NP à a

PP à p

NP à D N1

S à V NP

A à a

NP à p

NP à A N1

VP à VP PP

D à d

V à v

N1 à A N1

S à VP PP

NP à d

VP à v

NP à P NP

VP à VP2 NP

N1 à n

S à v

PP à P NP

S à VP2 NP

NP à n

 

 

The first step is to fill the first row of the chart with possible non-terminals which start with the terminal symbol in the input string. The next step is a recursive step. For each in the nth row there are n-1 combinations of pairs of non-terminals that can be used to try and add a non-terminal to the cell. For example in the second row, the first cell has only one pair of cells to check to see if by taking a non-terminal from each a new non-terminal can be formed. In row 4, the first cell ([4,4]) has three possible pairs to choose from. These pairs are: ([1,1],[3,4]), ([2,2],[2,4]) and ([3,3],[1,4]). All the possible pair combinations of non-terminals which form new non-terminals are added to this cell. The process continues until all the cells have been filled. The input string has been successfully parsed if the start symbol, in this case S, appears in the last cell. [11]

 

1

n

2

v

3

d

4

n

5

p

6

d

7

n

1

N1,NP

V,VP,S

D, NP

N1, NP

P, PP,NP

D, NP

N1,NP

2

 

S

S,VP,VP2

NP

NP

NP,PP

NP

3

 

 

S

S,VP,VP2

NP

NP

PP,NP

4

 

 

 

S

VP,VP2,S

NP

NP

5

 

 

 

 

S

VP,VP2,S

NP

6

 

 

 

 

 

S

VP,VP2,S

7

 

 

 

 

 

 

S

 

This algorithm is not so suitable for parsing because it demands that the grammar be in CNF, which is rarely the case in practice, although it has been proved that all grammars can be converted into CNF. The main problem with this however is that this often leads to an exponentially large number of rules. If this algorithm is extended to deal with normal CFGs it becomes much more inefficient and the runtime increases exponentially. [11]

The Earley Algorithm

An Earley parser is essentially a generator that builds left-most derivations of string, using a given set of context-free productions.  The parser keeps a set of states for each position in the input, describing all pending derivations. These state sets form the Earley chart. A state is of the form i: kXà l·m where X is a non-terminal, l and m are strings of non-terminals and/or terminals and i and k are indices  into the input string. The current position is i, X was expanded starting at position k and the expansion of X proceeded using the production X à lm, and has expanded the right hand side up to the position indicated by the dot. A state with a dot to the right hand side of the entire right hand side is called a complete state. There are three types of operations that an Earley parser will carry out, prediction, scanning and completion. A prediction corresponds to a potential expansion of a non-terminal in a left-most derivation. Scanning moves the dot over the current symbol, if it finds a match in the current input. Completion corresponds to the end of a non-terminal expansion started by a prediction step. [12]

 

5.      Probabilistic Context-Free Parsing

Context-free grammars are widely used as models of natural language syntax. In their probabilistic version, which defines a language as a probability over strings, they have been used in a variety of applications. [12]

The simplest way to gather statistical information about a grammar is to count the number of times each rule is used in a corpus containing parsed sentences and use this to estimate the probability of each rule being used. [2] The simplest probabilistic model for recursive embedding is a Probabilistic (or Stochastic) Context-Free Grammar. This is basically a CFG with probabilities associated with each rule, indicating how probable a rewrite rule is. The algorithms used for PCFGs are a natural development from the algorithms employed with HMMs. [9]

Formally Charniak [4] describes a probabilistic context-free grammar as a four-tuple <W, N, N1, R> where W is a set of terminal symbols {w1,…ww}, N is a set of non-terminal symbols {N1,…Nn}, N1 is the starting symbol, and R is a set of rules, each of which is of the form Ni à zj where zj is a string of terminals and non-terminals. Each rule has the probability P (Ni à zj)

A non-terminal dominates constituents if in a derivation, all of the constituents are derived (directly or indirectly) from the non-terminal.

dominates

 
 

 

 


To find the probability of a tree in a PCFG model, just multiply the probabilities of the rules that built its local sub-trees. [9] The probability of a string is defined as the sum of the probabilities of the parse trees that have this string as a yield. The probability of a parse tree is defined as the probability of the corresponding leftmost derivation. [6] Since the number of parses grows exponentially with the length of a sentence, an algorithm similar to the Viterbi algorithm can be used to find the most probable parse for a sentence.

There are many advantages to using PCFGs over normal CFGs to determine the structure of a sentence. As grammars get larger and larger, they become more and more ambiguous. A PCFG will give an indication of the most likely parse. It can also be programmed to give the n most probable structures, which can be useful when sentences are ambiguous to the reader. Often CFGs return structures, which intuitively a human would never have chosen. For example, Martin et al. (1987) report their system giving 455 parses for the sentence: “List the sales of the products produced in 1973 with the products produced in 1972”. [3,8] While it is obvious that there is a certain degree of ambiguity due to the PP attachments, a human would more than likely not have found many of the computer-generated parses, simply because we can rule out certain combinations of words purely on the basis of our common sense. This indicates how a probabilistic grammar would be useful in simply suggesting the more probable parses, instead of a human having to go through each of the 455 parses to determine whether they were reasonable or not.

PCFGs do not, however, take context or lexical co-occurrence into consideration. This means that sometimes some of the parses suggested by a PCFG may not be the best, because it will have based its plausibility purely on syntactic structures.  Charniak [4] illustrates this with an example of PP attachment. PPs, statistically, attach slightly more often to NPs than to VPs. However in the following two sentences, it is more the meaning of the words than the structure of sentence, that causes the PPs to be attached differently.

·         Alice bought a plant with Mary.

·         Alice bought a plant with yellow leaves.

      PCFGs tend to be robust [9]. They produce a model of a language based on real data, and therefore do not have to worry about things like grammatical mistakes, which occur in real-life situations. Although PCFGs have many advantages, the main, and I believe critical, disadvantage is that context is not taken into account at all. In fact a trigram model of a language would probably achieve better results [4], even though it takes no account of internal structures in the language. Manning and Schütze [9] also point out that even structurally a PCFG is also deficient. A PCFG takes no account of where a phrase appears in a sentence. The expanding of a subject NP is often different to the expanding of an object NP. Many people have noted this disadvantage and many attempts have been made to combine other theories and PCFGs to achieve better results.

The Ƥearl and Ƥicky Projects

            Ƥearl is a probabilistic Earley-style parser. It is a bottom-up chart parser with Earley-type top-down prediction which pursues the highest-scoring theory in the chart, where the score of a theory represents the extent to which the context of the sentence predicts that interpretation [7] Ƥearl uses Church’s part-of-speech assignment trigram model, a simple probabilistic unknown word model, and a conditional probability model for grammar rules based on part-of-speech trigrams and parent rules. Ƥearl avoids the problem of not having any context information, by using a context sensitive conditional probability CFG, where context is partly influenced by the part-of-speech sequences. Ƥearl also uses the geometric mean to score theories, rather than just multiply probabilities. Magerman and Mitchell [7] claim that by simply multiplying probabilities, you are over-penalising infrequent constructs. Ƥearl also has functions to deal with unknown words and idioms. If it cannot successfully parse a sentence, it will attempt to create a partial parse of the sentence.

            This project is one example of how a statistical grammar can be combined with more successful n-gram methods to parse. It parsed with an accuracy of 88% (on a test suite of 40 sentences) which is considerably better than what a parser using just a PCFG can do. Its successor Ƥicky also had similar results.

Ƥicky is a probabilistic agenda-based chart parsing algorithm which uses a technique called probabilistic prediction to predict which grammar rules are likely to lead to an acceptable parse of the input. [8] The parser limits the edges that can be added to the chart, to only those predicted by its probabilistic prediction technique.

Ƥicky uses the CKY chart-parsing algorithm. This algorithm parses efficiently and robustly in polynomial time. Although the Earley algorithm is more efficient than the CKY algorithm, it is not as robust when parsing ungrammatical sentences.

 

6.      Conclusion

Statistical methods for natural language processing are becoming more and more widely accepted in the computational linguistics field, because of their practicality for computation. However Chomsky, the most famous and well-respected linguist, has argued against the use of statistics in linguistics.  He argued that probability theory is inappropriate for formalising the notion of grammaticality, since computing the probability of sentences from a corpus, assigns the same low probability to all unattested sentences, grammatical and ungrammatical ones alike. (Chomsky 1957)[9] However, I believe that statistical methods used in NLP today are appropriate. Methods developed always try to find a suitable probability for unseen events, this is indeed one of the challenges of developing a successful statistical method for analysing natural language. Abney [1] argues that many of Chomsky’s arguments against statistical models, HMMs in particular, were against FSA, rather than the fact that they were stochastic. I believe that statistics can provide a good insight to actual occurrences in natural language. Due to the fact that the statistics we use are taken from corpora of language spoken or written by native speakers, I believe that this information can often be more useful than a theoretical attempt to represent for example the grammar rules of a language.  By adding probabilities to such rules, we also get a better idea of which rules are more common, and which rules may be rare. One often finds quite rare and sometimes extreme examples when linguists are describing some phenomena of a language. While these examples might be interesting, I think they do not give a true picture of a language, since they give no indication of how rare or typical such an utterance might be. This is a difference between linguistic competence and linguistic performance. I believe that when developing an NLP system, that the emphasis should be on linguistic performance, since human speakers of a language will react quicker to utterances they are likely to say or have heard themselves.

In my opinion PCFGs give a much better overall view of a language.  However, on their own, they cannot predict the structure of a sentence accurately enough. This is simply because, as their name suggests, they do not take context into consideration. This implies that they have to be combined with other methods of analysis to be effective. Charniak (1996) investigates how vital lexical information is to parsing. He induced a maximum likelihood PCFG from the Penn Treebank, made some small modifications and used this to parse some unseen sentences. His results were surprisingly good. Manning and Schütze [9] argue that this is maybe because many parsing decisions are mundane after all and can be handled by an unlexicalised PCFG.

I believe that the efficiency and accuracy can be improved by combining a simple probabilistic parser with other methods, similar to those outlined in the Ƥearl and Ƥicky Projects. I have written a short Perl program to extract a PCFG from the Penn Treebank. This program outputs a list of grammar rules found, a list of lexical items found, and a list of the probabilities associated with each rule found. Many of the rules are quite flat, and do not contain all the parts of speech. For my third year implementation, I attempted to assign parts of speech to some of the words by assuming that an untagged word in a phrase was the head of the phrase if there had been no head found already, but this is not always correct. Next year I intend to extract the correct part of speech tags for all untagged words in the parsed version, from the tagged version. This will increase the accuracy of the rules produced by the program. I intend to use this PCFG to write a probabilistic parser. I will first write a simple probabilistic parser using only this PCFG, and then try to develop a more complex algorithm taking more lexical information into account, and hopefully show that by using the lexical information the accuracy and efficiency of the parser is increased.

A sample output of the program is:

Input

( (S (NP-SBJ (NP Pierre Vinken)

             ,

             (ADJP (NP 61 years)

                   old)

             ,)

     (VP will

         (VP join

             (NP the board)

             (PP-CLR as

                     (NP a nonexecutive director))

             (NP-TMP Nov. 29)))

     .))

 

Output

 

Lexicon
Rules
Probabilistic Rules

ADJ->old: 1

ADJP->NP ADJ: 1

ADJP->NP ADJ: 1

NP->61 years: 1

NP-SBJ->NP ADJP : 1

NP-SBJ->NP ADJP : 1

NP->Pierre Vinken: 1

PP-CLR->P NP : 1

PP-CLR->P NP: 1

NP->a nonexecutive director: 1

S->NP-SBJ VP : 1

S->NP-SBJ VP: 1

NP->the board: 1

VP->V NP PP-CLR NP-TMP : 1

VP->V NP PP-CLR NP-TMP: 0.5

NP-TMP->Nov. 29: 1

VP->V VP : 1

VP->V VP : 0.5

P->as: 1

 

 

V->join: 1

 

 

V->will: 1

 

 

 

 

Bibliography

[1] Abney, Steven (1996) Statistical Methods and Linguistics: In Judith Klavans and Philip Resnik, eds., The Balancing Act. MIT Press, Cambridge, MA.

 

[2] Allen, James (1995) Natural Language Understanding Benjamin/Cummings Publishing Company Inc.

 

[3] Bod, Rens (1995) Enriching Linguistics with Statistics ILLC Dissertation Series 1995-14

 

[4] Charniak, Eugene (1993) Statistical Language Learning:  MIT Press, Cambridge, MA

 

[5] Grune, Dick and Jacobs, Ceriel J.H (1990) Parsing Techniques a Practical Guide Ellis Horwood Limited

 

[6] Krenn, Brigitte and Samuelssonm Christer (1997) The Linguist’s Guide to Statistics http://www.coli.uni-sb.de/~krenn/edu.html Last visited: 23rd  October 2000

 

[7] Magerman, David M. and Marcus, Mitchell P.  (1991)Ƥearl: A Probabilistic Chart Parser in Proceedings of the European ACL Conference, March 1991. Berlin, Germany

 

[8] Magerman, David M. and Weir, Carl (1992) Efficiency, Robustness and Accuracy in Ƥicky Chart Parsing in Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics ,June 1992. Delaware, USA

 

[9] Manning, Christopher D and Schütze, Hinrich (1999) Foundations of Statistical Natural Language Processing MIT Press, Cambridge, MA

 

[10] Martin W., Church,K and Patil, R (1987) Preliminary Analysis of a Breadth First Parsing Algorithm: Theoretical and Experimental Results” in L. Bolc (ed) Natural Language Parsing Systems, Springex Verlag, Berlin

 

[11] Schmid, Helmut (1999) taken from lecture notes of the course Parsing I which took place in the winter semester 1999/2000 at Universität Stuttgart, Germany

 

[12] Stolcke, Andreas (1993) An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities Berkley, CA (revised 1994)

 

[13] Winograd, Terry (1983) Language as a Cognitive Process Volume 1: Syntax Adison-Wesley Publishing Company

 

 

#!/usr/bin/perl

 

###############################################################

#                                                             #

# A Project to extract a Probabilistic Context Free Grammar   #

# from the Penn Treebank                                      #

# written as part of a third year project for DCU             #

# Aoife Cahill                                                #

# August 2000                                                 #

#                                                             #

# usage: pcfg.pl directory_name                               #                                                            #        where directory_name is the name of the directory    #

#        in which the parsed files are kept                   #

#                                                             #

###############################################################

 

#read in the directory in which all the parsed files are kept

 

$suffix = ".prd";

%lexicon =();

%rules = ();

 

#read in the directory in which all the parsed files are kept

 

$directory = shift;

@files = get_files($directory);

 

foreach $file(@files){

      parse ($file);

}

 

open(OUTLEX, ">lexicon") or die "Error: $!";

open(OUTRULE, ">rules") or die "Error: $!";

open(OUTPROB, ">probs") or die "Error: $!";

 

foreach $key (sort keys(%lexicon)){

      print OUTLEX "$key: $lexicon{$key}\n";

}

 

foreach $key(sort keys(%rules)){

      print OUTRULE "$key: $rules{$key}\n";

}

 

get_probs("rules");

 

foreach $key(sort keys(%probabilities)){

      print OUTPROB "$key: $probabilities{$key}\n";

}

 

close(OUTLEX) or die "Error : $!";

close(OUTRULE) or die "Error : $!";

close(OUTPROB) or die "Error : $!";

 

#####################################

# Sub-routine get_files($)          #

#                                   #

# gets a list of files that provide #

# the data for the pcfg             #

#####################################

 

sub get_files($){

 

      my $dir = shift;

      my @listing = glob "$dir/*";

 

      if (@listing > 0){

            foreach $file(@listing){

                  if(-f $file){

                        #only add .prd files to list

                        if ($file =~ /\Q${suffix}\E$/){

                              push @parsed_files, $file;

                        }

                  }

                  elsif (-d $file){

                        get_files($file);

                  }

                  else {

                        #not interested in directory element

                  }

            }

      }

      return @parsed_files;

}

 

############################################

# Sub-routine parse($)                     #

#                                          #

# Extracts grammar rules and their         #

# frequencies from each data file          #

############################################

 

sub parse($){

 

      my @string = ();

      $file = shift;

      print "processing file : $file\n";

 

      open(FILE,$file) or die "Error : $!";

     

      while (<FILE>){

            chomp;

            if ($_ =~ /^\s/){

                  s/^\s+//;

                  my $str = clean($_);

                  push @string, $str;

            }

            elsif (@string > 0) {

                  process_string(@string);

                  @string = ();

                  my $str = clean($_);

                  push @string, $str;

            }

            else {

                  my $str = clean($_);

                  push @string, $str;

            }

      }

      #process the last sentence

      process_string(@string);

}

 

###################################

# Sub-routine process_string($)   #

#                                 #

# puts the sentence into one      #

# string variable                 #

###################################

 

sub process_string($){

      my @string = @_;

      my $sentence = "";

      for($i=0;$i<@string;$i++){

            $sentence = $sentence.$string[$i];

      }

      $sentence = substr($sentence,1,-1);       #get rid of extra ()

      parse_string($sentence);

 

}

 

###################################

#  Sub-Routine parse_string($)    #

#                                 #

# gets the grammar rules from one #

# sentence and increments freq.   #

###################################

 

sub parse_string($){

      my $string = shift;

      my @categories = @_;

      my $num = @categories;

 

      if ($string =~/^[0\@\*]/){

            #this is a form of null element, ignore it

            #starts with a 0 or @

            pop @categories;

            my $rest = $';

            parse_string($rest,@categories);

      }

      elsif ($string =~ /^\(([A-Z]+([-|\|][A-Z]+)*)([-=]?[0-9]+)? ?/){

           

            # new category... add category to previous rule

            # and start new rule for this category...

 

            #CAT(

           

            my $category = $1;

            if ($num > 0){

                  $categories[$num-1] = "$categories[$num-1]"."$category"." ";           

            }

            my $new_rule = "$category->";

            push @categories, $new_rule;

            parse_string($',@categories);

      }

      elsif ($string =~ /^([A-Za-z0-9\$\£\%\#]+[A-Za-z0-9\- \.\&\%\$\/\#\\]*)\)/){

 

            #word)

            my $value = $1;

            my $rest = $';

            my $category = $categories [$num-1];

        if ($category =~ /([A-Z\-]+)P\-\>[A-Z\- ]+$/){

 

        #not all words are given categories.... category assumed

        #to be that of mother

        #e.g. (ADJP(NP 61 years)old)

        #old assumed to be an ADJ

            my $type = $1;

            my $rule = "$type->$value";

            add_to("lexicon",$rule);

            $category = $categories[$num-1].$type ;

            pop @categories;

            add_to("rules",$category);

        }

        else{

                  if ($num >0){

                              $categories[$num-1] = $categories[$num-1].$value;

                              my $rule = pop @categories;

                              add_to("lexicon",$rule);

                  }

            }    

            parse_string($rest,@categories);

      }

      elsif ($string =~ /^([A-Za-z0-9\$\#]+[A-Za-z0-9\-\.\&\%\#\$\/\\\s]*)([A-Za-z0-9 ]+)?\(/){

            #if a word appears with no tag, it is assumed to have the

            #type of the main category.

            #eg. (VP will(VP join

            #will and join are taken to be verbs

 

            #word(

 

        my $value = $1;

        if ($num > 0){

            my $category = $categories[$num-1];

            if ($category =~ /([A-Z]+)P(-[A-Z]+)?->/){

                        my $type = $1;

                        my $rhs = $';

                        if ($rhs =~ /\b$type[P]?/){

                              $categories [$num-1] = $categories[$num-1].$value." ";

                        }

                        else{

                              my $rule = "$type->$value";

                              add_to("lexicon",$rule);

 

                              $categories[$num-1] = "$categories[$num-1]"."$type"." ";

                        }

                  }

                  else {

                        $categories[$num-1] = $categories[$num-1].$value." ";

                  }    

        }

        my $rest = "($'";

        parse_string($rest,@categories);

      }    

      elsif ($string =~ /^\)/){

 

            if ($num > 0){

                  my $rule = pop @categories;

                  #don't add empty productions

                  if ($rule){

                        if ($rule =~ /->[A-Za-z\$\#0-9]/){

                              add_to("rules",$rule);

                        }

                  }

            }

            parse_string($',@categories);

      }

      elsif ($string eq ""){

      #     print "end of sentence\n";

      #     print "@categories\n";

      }

      elsif ($string=~ /^[\.\s\-\&=]/){

            parse_string($',@categories);

      }

      elsif ($string =~ /^\( \(/){

            my $rest = "($'";

            parse_string($rest,@categories);

      }

      else {

            print "$string";

            print "oops.. something i haven't thought of\n";

      }

 

}

 

##################################

# Sub-Routine add_to($,$)        #

#                                #

# adds a rule to the lexicon     #

# or list of rules or            #

# increments its frequency count #

##################################

 

sub add_to ($,$){

     

      my $hash = shift;

      my $rule = shift;

 

      $rule =~ s/\s+/ /g;

 

      if (exists $$hash{$rule}){

            $$hash{$rule}++;

      }

      else{

            $$hash{$rule} = 1;

      }

}

 

#####################################

# Sub-Routine clean($)              #

#                                   #

# takes out all punctuation and     #

# null elements                     #

#####################################

sub clean($){

      my $string = shift;

     

      $string =~ s/[\'\`\?\!\"\;\:\,]//g;

      $String =~ s/\s+/ /g;

      $string =~ s/-[A-Z]+-//g;

      $string =~ s/\*[A-Z]+\*(-[0-9]+)?//g;

      $string =~ s/\*-[0-9]+//g;

      $string =~ s/[A-Z]+\*//g;

 

      return $string;

}

 

#####################################

# Sub-Routine get_probs($)          #

#                                   #

# assigns probabilities to each     #

# rule that was found               #

#####################################

 

sub get_probs($){

 

      my $hash = shift;

      %totals = ();

      %probabilities = ();

 

      foreach $key (keys(%$hash)){

            $key =~ /\-\>/;

            my $category = $` ;

            if(exists $totals{$category}){

                  $totals{$category}+=$$hash{$key};

            }

            else {

                  $totals{$category} = $$hash{$key};

            }

      }

 

      foreach $key (keys(%$hash)){

            $key =~ /\-\>/;

            my $category = $`;

            $prob = $$hash{$key} / $totals{$category};

            $probabilities{$key} = $prob;

      }

}