ACL3 Project 2000/1

 

Probabilistic Parsing Technology and Sentence Disambiguation

 

By

 

John Dowling, ACL3 : 98441337

 

Supervisors: Dr. Josef Van Genabith

                                                                                                                                    Dr. Andy way

 

 

 

Introduction

The purpose of this project is to initially make an overview of statistical approaches to NLP. This objective is achieved by firstly outlining the probability theorem used in statistical NLP and then applying this theorem to ‘Part-of-Speech Tagging’ (POS tagging) and probabilistic parsers. I have chosen to then concentrate on the area of Probabilistic Context-Free Grammars and to investigate at a deep level how these PCFGs can be used as parsers and to examine how efficiently they work in the area of sentential disambiguation by implementing a PCFG as a parser in Prolog. By acquainting myself strongly with these topics and providing myself with concrete information on these areas, I believe that this research will provide a strong basis for a 4th year implementation, regarding Context-Dependent Best-First parsers. The following contents should help provide a better overview of the structure of my project:

Contents

1. Relevant probability Theory:

·        1.1 An Introduction to Probability Theory

·        1.2 Conditional Probability

·        1.3 Bayes’ Rule

·        1.4 Statistical Independence

2. Part-of-Speech Tagging:

·        2.1 Overview

·        2.2 N-gram Models

·        2.3 Hidden Markov Models (HMMs)

·        2.4 The Viterbi Algorithm

3. Probabilistic Context-Free Grammars

·        3.1 Overview

·        3.2 Gathering Statistics for a PCFG

·        3.3 Computing Parse Probabilities

·        3.4 Mini Implementation and Testing

·        3.5 Further Development

·        3.6 Important Features of PCFGs

4. Conclusion

5. Bibliography

·        5.1 Books

·        5.2 Online References

 

 

1. Relevant probability theory

 

1.1 An Introduction to Probability Theory

 

Let us first explain some of the terms used with regard to statistical NLP:

 

 

 

         => Ω = {Head, Tail}.

 

 

Intuitively, the probability of some event is the likelihood that the event will occur. A probability of 1 indicates that the event is certain to occur and a probability of 0 indicates the event will definitely not occur. Therefore a probability of 0.5 would indicate that an event is equally likely to occur as not to occur. Probability can be defined more formally in terms of a random variable. A random variable may range over a predefined set or values or over infinite sets and continuous values. Here we will use random variables which range over a finite set of values. Consider the throwing of a die. The random variable THROW representing the result of throwing a die ranges over the six possible values {1,2,3,4,5,6}. No other value for THROW is available. Therefore we will talk of the probability of THROW = 1 simply as the probability of the event 1.

A probability function PROB, assigns a probability to every possible value of a random variable. Every probability function must have the following properties, where e1,…,en are the possible distinct values of a random variable E:

 

1.     PROB(ei)≥0, for all i.

2.    PROB(ei)≤1, for all i.

3.    i=1, n PROB(ei)=1.

 

 

 

 

 

1.1 Conditional probability

 

A very important mathematical notion involved in probabilistic parsing and ambiguity resolution, is that of Conditional Probability. It is defined by the formula:

 

PROB(e|e′) = PROB(e & e′) / PROB(e′)

 

Where PROB(e & e′) is the probability of the two events occurring simultaneously.

 

The following example explains the formulas’ usage more clearly:

Given weather readings for 100 days of the year, where, on a particular day it either rained or not, a weather reading is represented as a random variable R that has one of two values, Rain or Dry. Say that it rained 30 times overall. Therefore, the probability that it rained is PROB(R=Rain) = 0.3 and the probability that the day was dry is PROB(R=Dry) = 0.7. Yet, we have other variables that are related to each other. If we consider another random variable C representing cloud cover and ranging over the values sunny and overcast. Let us say that it was overcast on 60 of the 100 days that were observed. Of these 60 days, it rained on 30 occasions. Therefore, intuitively, if you were given the fact that it was overcast, i.e., that C = Overcast, then the probability that R = Rain would be 0.5 (30 out of 60). This intuition is captured in the concept of conditional probability and is written as PROB(Rain|Overcast). This could also be described as the probability of the event Rain occurring given the event Overcast. Let us see if the above formula agrees with our intuition:

 

We know that PROB(Overcast) = 0.6 and PROB(Rain & Overcast) = 0.3, and using the definition of conditional probability you can compute PROB(Rain|Overcast) and see that it agrees with our intuition:

 

 

PROB(Rain|Overcast)  = PROB(Rain & Overcast)

      PROB(Overcast)

 

                                      =  0.3 / 0.6

 

                                        = 0.5

 

 

1.2 Bayes’ Rule

 

Another very important mathematical notion, which must be explained, is called Bayes’ Rule. Bayes’ Rule relates the conditional probability of an event A given B to the conditional probability of B given A:

 

PROB(A|B)=PROB(B|A) * PROB(A)

                               PROB(B)

 

This rule can be illustrated simply using the previous example. Using the rule we are able to compute the probability that it was overcast on a day that on which it rained (which will of course = 1, as in our example it is always overcast on a rainy day)

 

PROB(Overcast|Rain) = (PROB(Rain|Overcast) * PROB(Overcast))

                                                               PROB(Rain)

 

                                            = (0.5 * 0.6) / 0.3

 

                                      = 1

 

Bayes’ Rule is sometimes useful in estimating probabilities in the event of the absence of some of the information required.

 

 

1.3 Independence (statistical)

 

Another notion to be considered is that of independence (statistical). Two events A and B are said to be statistically independent if PROB(A|B) = PROB(A) – i.e., whether or not B has in fact occurred has no effect on the probability that A will occur. Using the definition of conditional probability, this can be reformulated as:

 

PROB(A & B) = PROB(A) * PROB(B)

 

Now, let us try and relate this concept to Natural Language Processing. At first let us consider a simple case that helps us to illustrate the ideas with regard to the probability theory (we will go into more detail in section 2 on ‘part-of-speech tagging’): First, let us say that we need to identify the words that can be either nouns or verbs. We can formalize this using two random variables, one, C, which ranges over the parts of speech involved (in this case N and V) and two, W, which ranges over all the possible words. Let W = rains. The problem is to determine whether PROB(C=N| W=rains) or PROB(C=V|W=rains) is greater. The random variables are generally omitted from the formulas, therefore PROB(C=N| W=rains) will be written as PROB(N|rains) for short.  Thus, using the definition of conditional probability we calculate the probabilities for the different categories as follows:

 

          PROB(N|rains) = PROB(rains & N)

                                        PROB(rains)

 

          PROB(V|rains) = PROB(rains & V)

                                        PROB(rains)

 

We can reduce the problem this problem to:

 

          PROB(N|rains) = PROB(rains & N)

                                       

          PROB(V|rains) = PROB(rains & V)

 

since the denominator is the same in both cases.

 

Let us say that in order to obtain the probabilities for these cases we have a tagged corpus containing 1,000,000 words. Now let’s say that the word rains occurs in the corpus 1000 times, 700 times in a verb sense and 300 times in a noun sense. Therefore, the probability of a randomly selected word being rains as a noun and rains as a verb are as follows:

 

          PROB(rains & N) "a 300 / 1,000,000

                                     = 0.0003

 

          PROB(rains & V) "a 700 / 1,000,000

                                     = 0.0007

 

Therefore the most likely outcome is that rains will appear as a verb.  We can now compute the probability that rains is a verb using the formula:

 

 

          PROB(V|rains) = PROB(V & rains)

                                       PROB(rains)

 

                                  = 0.0007 / 0.001

 

                                  = 0.7

 

 

2. Part-of-speech Tagging

 

2.1 Overview

 

Part-of-speech tagging (‘tagging’ for short) is the process of going through a corpus of sentences and labelling each word in each sentence with its ‘part-of-speech’ or lexical class marker. A tagged corpus is a corpus that has been so labelled. A tag is one of the labels. Large-scale corpora, such as the Penn Treebank might have up to 40 different tags contained in its tagset. The following is an example of some of the tags used in the Penn treebank (Jurafsky & Martin, 2000, pg.297):

 

Tag
Description
Example

CC

Coordinating conjunction

and, but, or

VB

Verb, base form

eat

JJ

Adjective

yellow

NNS

Noun, plural

cats

RBS

Adverb, superlative

fastest

TO

To

to

 

 

Part-of-speech tagging is an area involved in the front end of probabilistic parsing that, given a sentence, containing ambiguous words, helps us determine the most likely sequence of syntactic categories for the words in the sentence. Given, that the core task of POS tagging is to choose the correct tag for each word in context from a set of possible tags, then if the tagging (or disambiguation) method is efficient, then our parser is more likely to act efficiently, minimising error.

In section 1.3 on statistical independence the basic idea behind POS tagging was explained, yet this method (called the ‘standard unigram baseline method’) was very primitive and chose the interpretation that occurred most frequently in the corpus without taking any context/collocations into account. In the above example we obtained a poor success rate of 70%, which will not get us far in a real world implementation. Yet, the word examined was ambiguous and the result does not give a true reflection of the results that might be obtained with the use of a much larger data source such as a treebank. Over half the words in corpora are not ambiguous and surprisingly using this method, on average we often obtain a success rate of over 90% (Allen 1995, p.195). We still need to improve our percentages and in order to do so we must consider more contexts, e.g., the sentences in which rains occurs. Yet, when we design a new classification algorithm, we should always compare its efficiency  against the unigram baseline.

          The important information we need is concerned with word collocations, that is, what words tend to appear together. We saw that choosing the verb sense of rains in our corpus appeared to be the right choice and would be correct about 70% of the time. Yet, if rains was preceded by the determiner the, it would much more likely be a noun. The following techniques are capable of exploiting the local context of the sentence in which the word appears:

          (Allen, 1995, p.196) Let w1,…,wT be a sequence of words. We want to find the sequence of lexical categories C1,…,CT that maximizes

 

1. PROB(C1,…,CT|w1,…,wT)

 

Unfortunately, it would take far too much data to generate reasonable estimates for such sequences directly from a corpus, so we must apply more indirect methods. However, there are methods that allow us to approximate. In order for us to develop them, we must reformulate the problem using Bayes’ rule, which restates PROB(C1,…,CT|w1,…,wT)  as

 

2. PROB(C1,…,CT) * PROB(w1,…,wT|C1,…,CT)/PROB(w1,…,wT)

 

As the common denominator PROB(wi,…,wT) does not affect the answer(since we are interested in finding C1,…,CT that gives the maximum value), in effect we want a sequence C1,…,CT that maximizes the formula

 

3. PROB(C1,…,CT) * PROB(w1,…,wT|C1,…,CT)

 

These probabilities can be approximated by probabilities that are easier to collect by making some independence assumptions. Though these independence assumptions are not really valid, the estimates tend to work quite well in practice. In making these assumptions we use n-grams which are detailed below.

 

 

2.2 N-gram Models

 

An n-gram is an n-tuple of things, for example words or lexical categories. Suppose that we are concerned with n lexical categories L1, L2,…,Ln. The term n-gram is used in statistical NLP in connection with the conditional probability that a word will belong to Ln given that the preceding words were in L1,L2,…,L[n-1]. This probability is written PROB(Ln|L[n-1]…L2 L1).

          The first factor in formula 3 above, PROB(C1,…,CT) can be approximated by a sequence of probabilities based on a limited number of previous categories (n-gram models, n representing the number of words in the particular pattern). The bigram model uses PROB(Ci|Ci-1) which is the conditional probability that a category Ci will follow a category Ci-1. The trigram model uses PROB(Ci|Ci-2 Ci-1), which takes the two preceding categories into account.

Using bigrams the following approximation can be used:

 

PROB(C1,…,CT) P (i=1, T) PROB(Ci|Ci-1)

 

In order to account for the beginning of a sentence, we invent a pseudocategory, ø at position 0 as the value of C0. Then the probability of a sequence DET N V N using a bigram model would be:

 

PROB(DET N V N) P PROB(DET|ø) * PROB(N|DET) * PROB(V|N) * PROB(N|V)

 

                  

The second factor in formula 3

 

PROB(w1,…,wT|C1,…,CT)

 

can be approximated by assuming that a word appears in a category independent of the words in the preceding or succeeding categories. It is approximated by:

 

(i=1, T) PROB(wi|Ci)

 

 

With these approximations, the problem has changed into finding the sequence C1,…,CT that maximizes the following formula:

 

* (i=1, T) PROB(Ci|Ci-1) * PROB(wi|Ci)

 

These probabilities can now be estimated from a tagged corpus. For example, the probability that a verb follows a noun can be estimated as follows:

 

PROB(Ci = V|Ci-1 = N) P Count(N at position i-1 and V at i)

                                            Count(N at position i-1)

 

 

For the rest of this section we can resume the discussion of how to find the most likely sequence of categories for a sequence of words as we now have a formula* which is the motivation for an efficient tagger.

 

          Most tagging algorithms fall into one of two classes, either rule-based taggers or stochastic taggers. ‘Rule based taggers’ involve a large database of hand-written disambiguation rules, which specify, for example, that an ambiguous word is a noun rather than a verb if it follows a determiner. On the other hand ‘Stochastic taggers’ generally resolve tagging ambiguities by using a training corpus to compute the probability of a given word having a given tag in a given context. For the sake of this project we will discuss in detail the Stochastic tagger called an HMM tagger (Hidden Markov Model tagger).

 

2.3 Hidden Markov Model

 

A Hidden Markov Model is a generative device. It generates or accepts sequences of words. In an HMM there are a set of states (lexical categories in our case) with directed edges, transitions between states (labelled with transition probabilities), and symbols emitted by the states (while in a state, one outputs a single symbol before moving to the next state). There are two kinds of probabilities associated with an HMM, transition probabilities, i.e. the probability of a transition from one state to another, and output probabilities, i.e. the probability of a certain state emitting a certain symbol. The assumption is that a word depends probabilistically on just its own ‘part-of-speech’ (i.e. its tag), which in turn depends on the part-of-speech of the preceding word (or on the start state in case there is no preceding word).

          A weighted finite-state automata or Markov Model is specified by the set of states Q, the set of transition probabilities A, a defined start state and end states and a set of observation likelihoods B. For weighted automata we will define the probabilities bi(ot) as 1.0 if the state i matches the observation ot and 0 if they don’t match. This is because we can only observe the states of the Markov Model in training(the tags of the labelled corpus), but we only observe words in applying the Markov Model(HMM) to the tagging class. The way in which an HMM differs from a Markov Model is that we add two more requirements.

 

 

Fig.1 An example HMM

 

In summary, here are the parameters that define an HMM (Jurafsky & Martin, 2000):

Each aij represents the probability of transitioning from state i to state j. The set of these is the ‘transition probability matrix’.

 

In our example HMM we see that there is a special, non-emitting start state in use. We can avoid the use of these non-emitting states (which also include end states) with the specification of two more things.

·        Initial distribution: an initial probability distribution over states, À, such that Ài is the probability that the HMM will start in state i. Of course some states j may have Àj = 0, so these cannot be initial states.

·        Accepting states: a set of legal accepting states.

 

As is the case for weighted automata, the sequences of symbols that are input to the model or which are produced by the model are generally called the observation sequence, referred to as O=(o1,o2,o3…ot).

          Using the above model, the probability of generating “dogcatchers catch old red fish” can be calculated as follows: first work out the probability of the lexical sequence N-V-ADJ-ADJ-N, which is 1 * 1 * 0.5 * 0.1 * 0.9 = 0.045, and then multiply this by the product of the output probabilities of the words, i.e. by 0.3 * 0.2 * 0.6 * 0.2 * 0.5 = 0.0036, for a final probability of 0.000162. More generally the formula for computing the probability of a sentence w1,…,wT given a sequence C1,…,CT is " (i=1, T)PROB(Ci|Ci-1) * PROB(wi|Ci) that was explained above. The rest of this section will explain and motivate this equation.    The key insight to be taken from the above is that because of the Markov assumption (which states that the only information affecting the probability of an output, or the next state, is the prior state), you do not have to enumerate all the possible sequences. Sequences which happen to end in the same category are easily dealt with using a HMM. They can be collapsed together since the next category only depends on the previous one in the sequence. So, if we just keep track of the most probable sequence found so far for each possible ending category, we can then just ignore all the other less likely sequences. In order to illustrate this procedure in a more concise and clear manner I have taken an example transition diagram from (Allen, “Natural Language Understanding” 2nd ed. Chapter 7,  ‘Ambiguity Resolution’) which represents a HMM.

 

 

Fig.2 The representation of a HMM, encoding the 256 possible sequences of ‘Flies like a flower’.

 

In order to find the most likely sequence of tags, we sweep forward through the words, one by one, finding the most likely sequence for each ending category. In other words, we find the four best sequences for the words ‘flies like’: the best ending with like as a V, the best as an N, the best as a P and the best as an ART. We then use this information to find the four best sequences for the words ‘flies like a’, each one ending in a different category. We keep repeating this process until we have accounted for all the words contained in the sequence. This process is described in the algorithm usually called the ‘Viterbi Algorithm’.

 

2.4 Viterbi Algorithm

 

For a problem involving T words, and N lexical categories, the Viterbi algorithm is guaranteed to find the most likely sequence using k*T*N2 steps for some constant k. The reason we use this algorithm and the reason why it is so effective is that it is applicable in a range of situations that allows a space that apparently has an exponential number of points in it to be searched in polynomial time. It is the most effective method in finding the most likely sequence of tags for a sequence of words. In the above example in Fig.2 we see that the search takes (NT = 44) 256 steps to find the most likely sequence. Using the Viterbi algorithm the search would k*4*16 steps, which is sure to be a significantly faster method than the brute force method applied above.  Let us consider the algorithm in more detail (Manning & Schütze, 1999, p.332):

 

Commonly we want to find the most likely complete path, i.e.:

 

arg max P(X|O, μ)

      X

In order to do this, it is sufficient to maximize for a fixed O:

 

arg max P(X,O|μ)

      X

The Viterbi algorithm will compute this path. Define:

 

δj (t) = max P(X1…Xt-1,o1…ot-1,Xt = j|μ)

 

This variable stores for each point along the trellis the probability of the most probable path that leads to that node. The corresponding variable ψj(t) then records the node of the incoming arc that led to the most probable path through the whole trellis as follows:

 

1. Initialisation step

 

δj (1) = πj, 1 ≤ j ≤ N

 

2. Induction step

 

δj (t+1) = max δi(t)aijbijot, 1 ≤ j ≤ N

              1 ≤ i ≤ N

 

Store backtrace

 

ψj(t+1) = arg max δi(t)aijbijot, 1 ≤ j ≤ N

               1 ≤ i ≤ N

3. Termination and path readout (by backtracking). The most        likely state sequence is worked out from the right backwards:

 

XT+1 = arg max δi(T+1)

           1 ≤ i ≤ N

Xt = ψXt+1(t+1)

 

P(X) = max δi(T+1)

         1 ≤ i ≤ N

 

Algorithms like this can perform effectively if the probability estimates are computed from large-scale corpora that contain the same style data as the input to be classified. The error rate can be cut in half using this type of technique. It is now clear how part-of-speech taggers can be used as a front-end to parsers and we will come to understand their close relation in the next section.

 

 

3. Probabilistic Context Free Grammars

 

3.1 Overview

 

Probabilistic context free grammars (PCFG’s) are one of the many ways of building probabilistic models of syntactic structure. A probabilistic (sometimes called ‘stochastic context-free grammar’) is simply an augmentation of a context free grammar with probabilities assigned to the grammar rules, indicating how likely different rewritings are. They are the simplest and most natural probabilistic models for recursive embedding within tree structures. The probabilistic theory behind them is well understood and the algorithms for them are a progression from the algorithms employed with such models as HMM’s (Hidden Markov Models). (Allen, 1995)Finite state machines, such as HMM’s can be generalized to the probabilistic case. Context free grammars can also be generalized. In order to do this we must have some statistics on our grammar rule use. The simplest method is to count the number of times a rule appears in a corpus of parsed sentences and use this to estimate the probability of the rule being used. Fig.3 on page 15 shows a probabilistic CFG with the probabilities derived from analysing a corpus of parsed sentences. This PCFG provides the user with a computational device that is sufficient enough to be generalised so that it can simulate various forms of probabilistic conditioning, in this case, sentential disambiguation.

          (Manning & Schütze, 1999) A context-free grammar G is defined by four parameters,  (N, Σ, P, S):

 

·        A set of non-terminal symbols (or ‘variables’) N

·        A set of terminal symbols Σ (disjoint from N)

·        A set of productions P, each of the form A → β, where A is a non-terminal and β is a string of symbols from the infinite set of strings    (Σ U N)

·        A designated start symbol S

 

A probabilistic context-free grammar augments each rule in P with a conditional probability:

 

          A → β [p]

 

Therefore a PCFG is a 5-tuple G = (N, Σ, P, S, D) consisting of:

·        A set of non-terminal symbols (or ‘variables’) N

·        A set of terminal symbols Σ (disjoint from N)

·        A set of productions P, each of the form A → β, where A is a non-terminal and β is a string of symbols from the infinite set of strings    (Σ U N)

·        A designated start symbol S

·        A function D which assigns probabilities to each rule in P. This function D expresses the probability p that the given non-terminal A will be expanded to the sequence β. It is often written as:

 

P(A → β)

 

Note that when we write P(A → β) we always mean P(A → β|A). That is, we are giving the probability distribution of the daughters of a certain head. In formal terms, this is the conditional probability of a given expansion given the left-hand-side non-terminal A. Thus, if all the possible expansions of the non-terminal were considered the sum of their probabilities would equal to 1. Such a grammar can be used either to parse or generate sentences of the language. Intuitively,     P(A → β) means the probability of expanding the non-terminal A using this rule as opposed to any of the other rules for A. As with non-probabilistic versions of context-free grammars, we can look at a PCFG as defining a context-free language that specifies how to expand the starting symbol into the strings in the language or, conversely, how to assign a structure to a given string in the language. The main difference is that a probabilistic context-free grammar also assigns a probability to the string such that all the probabilities of all the strings sum to one. Fig.3 shows an example probabilistic CFG taken from (Charniak 1993, p.76).

 

 

   S  → NP VP          : 0.8

   S  → VP               : 0.2

  NP → NOUN          : 0.4

  NP → NOUN PP     : 0.4

  NP → NOUN NP     : 0.2

  VP → VERB           : 0.3

  VP → VERB NP      : 0.3

  VP → VERB PP      : 0.2

  VP → VERB NP PP : 0.2

  PP → PREP NP      : 1.0

 

 

 PREP → like        :1.0

 VERB → swat      :0.2

 VERB → flies       :0.4

 VERB → like        :0.4

 NOUN → swat      :0.05

 NOUN → flies       :0.45

 NOUN → ants       :0.5

       Fig.3 An example PCFG

 

The above PCFG contains grammar rules with assigned probabilities. The non-terminal symbols are S, NP, VP, PP, PREP, VERB and NOUN. The terminal symbols are the words written in italics. Before we go into any more detail on the function of a PCFG we must ask the question, where have these PCFG probabilities come from? And how can we obtain reliable statistics?

 

 

3.2 Gathering statistics for a PCFG

 

Before we investigate how to compute parse probabilities we must first ask the question, where do PCFG probabilities come from? There are two possible methods of assigning probabilities to a grammar. The simplest method is to use a corpus of already parsed sentences. Such a corpus is called a treebank. In section 2 we discussed how part-of-speech tagging methods help us to determine the most likely sequence of syntactic categories for the words in a sentence. Given, such a treebank with part-of-speech tags already assigned we can use statistical methods to identify the common structures in English and favour these in the parsing algorithm. The probability of each expansion of a non-terminal can be computed by counting the number of times that expansion occurs and then normalizing.

 

PROB(a -> b|a) = Count(a -> b)      =  Count(a -> b)

                          g Count(a -> g)      Count a

 

This part-of-speech tagging method allows us to choose the most likely interpretation when a sentence is ambiguous and will, in time lead to very efficient parsers. They allow for a considerable amount of ambiguity to be resolved before our parser has even begun to work. Yet, if the tagging is incorrect we are left with considerable problems. The parser may be prevented from ever finding the correct interpretation for a sentence or, even worse the parser may find a valid, but implausible interpretation based on a badly tagged word and never be able to realise its error. Even though, such algorithms as the Viterbi algorithm have a tagging success rate of up to 96%, they do have their limitations with the chances of a twelve-word sentence  being correctly tagged being 46%.

          When a treebank is unavailable then the counts needed for calculating PCFG probabilities can be generated by first parsing a corpus. In an ideal world where sentences were unambiguous, it would be as simple as this: parse a corpus, increment a counter for every rule in the parse, and then normalize these to obtain our probabilities. However, since this is not the case and most sentences are ambiguous, in reality we need to keep a separate count for each parse of a sentence and weight each partial count by the probability of the parse it appears in. The algorithm used for computing this method is called the Inside-Outside algorithm, and was proposed as a generalisation of the Forward-Backward algorithm. The Inside-Outside algorithm is a training algorithm which allows us to train the parameters of a PCFG on unannotated sentences of the language. The basic assumption behind them is that a good grammar is one that makes the sentences in the training corpus likely to occur, and hence we seek the grammar that maximizes the likelihood of the training data. Again the details regarding this area are outside the scope of this project and the idea behind the algorithm has been described merely as a way to illustrate the methods of assigning probabilities to PCFG’s.

 

 

3.3 Computing Parse Probabilities

 

Now that we know where these probabilities are derived, we must ask the question, how are these probabilities used to solve parsing problems? With this PCFG we can now develop algorithms, which given a sentence help us to estimate probabilities that will find the most likely parse tree that could have generated that sentence. This attribute of a PCFG is particularly useful in solving the problem of ambiguity. For example, let us consider the two parses of the sentence “Swat flies like ants” (one meaning a command to swat flies and the other, less intuitive meaning is taken in relation to a certain type of ‘swat fly’ that likes ‘ants’). Two of the possible tree structures are represented in Figures 4.1 and 4.2 below (Charniak 1995).

         Fig.4.1 (T1)                         Fig.4.2 (T2)

 

The PCFG assigns a probability to each parse tree T, which is a derivation of a sentence S. The probability of each parse tree T is defined as the product of the probabilities of all the rules r used to expand each node n in the parse tree:

 

P(T, S) = ∏ P(r(n))

             nЄT

The resulting probability P(T, S) is both the joint probability of the parse and the sentence, and also the probability of the parse P(T). How can this be true? First, by the definition of joint probability we have:

 

          P(T, S) = P(T) P(S|T)

 

But as a parse tree includes all of the words of the sentence, then P(S|T) is

 

          P(T, S) = P(T) P(S|T) = P(T)

 

The probability of each of the trees in Fig.4 can be calculated simply by finding the product of the probabilities of all the rules used in the derivation. For example, the probability of the tree in Fig.4.1 (T1) and the tree in Fig.4.2 (T2) can be calculated as follows, by taking the probabilities from the PCFG in Fig.3:

 

P(T1) = 0.8 * 0.2 * 0.3 * 0.4 * 0.05 * 0.45 * 0.4 * 0.4 * 0.5

         = 3.456 * 10-5

 

P(T2) = 0.2 * 0.3 * 0.2 * 0.4 * 0.45 * 1.0 * 1.0 * 0.4 * 0.5

         = 4.32 * 10-3

 

We see that T2 has a higher probability and, thus this parse would be the one chosen by a disambiguation algorithm, which selects the parse with the highest PCFG probability. The most probable parse was also intuitively the most likely and therefore it was successful in both the parsing and disambiguation of the sentence. We will now look at how the above probabilistic CFG can be implemented as a parser. (Re: http://www.ling.gu.se/~nivre/kurser/wwwstat/pdcg.html) It is a quite straightforward method, which involves encoding a context-free grammar using the built in DCG (definite clause grammar) available in most implementations of Prolog. All that is required to assign probabilities is that every category symbol (term) is extended with an extra argument for the probability of that constituent, and that every rule is extended with a Prolog call in order to multiply the probabilities of the daughters with the probability of the rule in order to obtain the probability of the entire constituent. Thus a PCFG rule of the form:

 

x0 à x1…xn : p

 

Where p is the rule probability, will be translated into the following DCG rule:

 

x0 (P0) à x1(P1),,xn(Pn), {P0 is p * P1 *…* Pn}.

 

When using the grammar for parsing, it is usually convenient to extend each category symbol (term) with yet another argument where the parse tree is built in the form of a term. Thus, we have:

 

x0 (P0,x0(X1,…,Xn)) àx1(P1, X1),,xn(Pn, Xn), {P0 is p * P1 *…* Pn}.

 

Parsing a string S (represented as a list of words) using this DCG style encoding, is done by following the Prolog call:

 

?- x0(P, T, S, []).

 

Where T is instantiated to a term representing a parse tree, while P is instantiated to the probability of T. We will encode the above PCFG from Fig.3 in a DCG (Prolog) format in Fig.5 as our mini implementation.

 

 

3.4 Mini-Implementation and Testing

 

 

%Fig.5 Mini Implementation

%Pcfg.pro

%An example probabilistic CF-PSG in prolog as a DCG %with extensions

 

s(P0,s(NP,VP)) --> np(P1,NP), vp(P2,VP), {P0 is 0.8*P1*P2}.

s(P0,s(VP)) --> vp(P1,VP), {P0 is 0.2*P1}.

 

np(P0,np(N)) --> n(P1,N), {P0 is 0.4*P1}.

np(P0,np(N,PP)) --> n(P1,N), pp(P2,PP), {P0 is 0.4*P1*P2}.

np(P0,np(N,NP)) --> n(P1,N), np(P2,NP), {P0 is 0.2*P1*P2}.

 

vp(P0,vp(V)) --> v(P1,V), {P0 is 0.3*P1}.

vp(P0,vp(V,NP)) --> v(P1,V), np(P2,NP), {P0 is 0.3*P1*P2}.

vp(P0,vp(V,PP)) --> v(P1,V), pp(P2,PP), {P0 is 0.2*P1*P2}.

vp(P0,vp(V,NP,PP)) --> v(P1,V), np(P2,NP), pp(P3,PP), {P0 is 0.2*P1*P2*P3}.

 

pp(P0,pp(P, NP)) --> p(P1, P), np(P2, NP), {P0 is 1.0*P1*P2}.

 

p(1.0,p(like)) -->[like].

v(0.2,v(swat)) -->[swat].

v(0.4,v(flies)) -->[flies].

v(0.4,v(like)) -->[like].

n(0.05,n(swat)) -->[swat].

n(0.45,n(flies)) -->[flies].

n(0.5,n(ants)) -->[ants].

 

 

Parsing the string “swat flies like ants” is done with the following Prolog query:

 

?- s(Prob, T,[swat, flies, like, ants],[]).

 

Prolog gives us back the following sentence parses and their associated probabilities:

 

1.     Prob = 0.000256

T = s(np(n(swat)), vp(v(flies), pp(p(like), np(n(ants))))) ;

 

2.     Prob = 3.456e-005

T = s(np(n(swat), np(n(flies))), vp(v(like), np(n(ants)))) ;

 

3.     Prob = 0.000432

T = s(vp(v(swat), np(n(flies), pp(p(like), np(n(ants)))))) ;

 

4.     Prob = 0.000288

T = s(vp(v(swat), np(n(flies)), pp(p(like), np(n(ants))))) ;

 

According to our parser the most likely parse for the sentence “swat flies like ants” is that represented in parse 3. This is the parse that was represented as a tree structure in Fig.4.2. Parse number 2 is that represented by the tree structure in Fig.4.1, which happens to be the least likely parse for the sentence. Of course our program is not just limited to parsing this sentence. Let us look at another example, “ants like flies”. The Prolog query in this case being:

 

?- s(Prob, T,[ants, like, flies],[]).

 

In this case there is only one possible parse for the sentence and Prolog gives us back:

 

Prob = 0.003456

T = s(np(n(ants)), vp(v(like), np(n(flies)))) ;

 

This tells us that, according to our PCFG this sentence is unambiguous as there is only one possible parse. Yet, intuitively there can be more than one parse, such as one that says ‘ants are like flies’. Yet, our grammar does not allow for the rule, pp à p, as this particular PCFG is limited to a small grammar. Though our PCFG will allow for this type of intuition if we declare this as  np à noun pp:

 

?- np(Prob,T,[ants,like,flies],[]).

 

Prob = 3.6 * 10-2

T = np(n(ants),pp(p(like),np(n(flies))));

 

I have also encoded another grammar in Fig. 6 below, by taking a DCG from my second year lecture notes and adding the required extensions to encode the probabilities. The terminal rule probabilities are not based on a corpus and were made up simply for expository purposes, whereas I obtained the lexical rule probabilities from an online statistical part-of-speech tagger (http://www.coli.uni-sb.de/~thorsten/tnt/), which gathers its information from the Penn Treebank. It indicates how simply we can apply simple part-of-speech tagging methods to a grammar in order to make it a working PCFG. This encoding of a PCFG also includes some relative clauses, number agreement and structure building.

 

 

%Fig. 6 Mini-Implementation

%pcfg3.pro

%An example probabilistic CF-PSG in Prolog as a DCG

%This version contains number agreement and structure %building. The lexical probabilities are taken from a %tagger used on the Penn Treebank

 

s(P0, s(NP,VP))--> np(Num,P1,NP), vp(Num,P2,VP),

                                                                               {P0 is 1.0*P1*P2}.

 


%Calculates rule probability

 

np(Num, P0, NP)--> det(Num,P1,Det), n(Num, P2, N), rel(Num,P3,R),

                 {P0 is P1*P2*P3},

                                              {build_np(Det,N,R,NP)}.

%Structure building

 

np(Num, P0, np(PN))--> pn(Num,P1,PN), {P0 is 0.35*P1}.

 

vp(Num,P0,vp(V,NP))--> tv(Num,P1,V), np(_,P2, NP),

                                                                                  {P0 is 0.4*P1*P2}.

vp(Num,P0,vp(V))-->iv(Num,P1,V), {P0 is 0.6*P1}.

 

rel(Num, P0, rel(C,VP))-->comp(P1,C), vp(Num,P2,VP),

                                                                                  {P0 is 1.0*P1*P2}.

  

   rel(Num,1.0,rel(empty))-->[].  %empty relative clause

 

det(_,1.0,det(the))--> [the].      %lexical production rules

det(sing,0.98,det(a))--> [a].     %determiners

det(sing,0.07,det(that))--> [that].

 

n(sing,0.66,n(boy))--> [boy].    %nouns

n(sing,1.0,n(girl))--> [girl].

n(sing,1.0,n(apple))-->[apple].

n(sing,0.31,n(pear))-->[pear].

 

n(pl,0.66,n(boys))--> [boys].

n(pl,1.0,n(girls))--> [girls].

n(pl,1.0,n(apples))-->[apples].

n(pl,0.31,n(pears))-->[pears].

 

pn(sing,0.74,pn(john))-->[john].    %proper nouns

pn(sing,0.35,pn(mary))-->[mary].

 

tv(_,1.0,v(ate))-->[ate].                     %transitive verbs

tv(sing,1.0,v(eats))-->[eats].

tv(pl,0.68,v(eat))-->[eat].

tv(_,0.91,v(saw))-->[saw].

tv(sing,1.0,v(sees))-->[sees].

tv(pl,0.6,v(see))-->[see].

 

iv(_, 1.0, v(slept))-->[slept].             %intransitive verbs

iv(sing, 1.0, v(sleeps))-->[sleeps].

iv(pl, 0.44, v(sleep))-->[sleep].

iv(_, 1.0, v(wept))-->[wept].

iv(sing, 0.81, v(weeps))-->[weeps].

iv(pl, 1.0, v(weep))-->[weep].

 

comp(0.85,comp(that)) --> [that].

 

%These allow us to manipulate the structure of our %sentence. When a relative has content it is passed over.

 


build_np(Det,N,rel(empty),np(Det,N)).       

build_np(Det,N,rel(C,VP),np(Det,N,rel(C,VP))).

 

 

Now, let us parse some sentences using this PCFG. First of all we will see how the parser deals with number agreement by inputting the sentence “the boy sleeps”. We query this parser in the same method as the one developed in Fig.5:

 

?- s(Prob,Tree,[the,boy,sleeps],[]).

 

Prob = 0.396

Tree = s(np(det(the), n(boy)), vp(v(sleeps))) ;

 

We can see from this parse that the parser correctly handles the sentence but will it catch an ungrammatical sentence like “the boy sleep”:

 

?- s(Prob,Tree,[the,boy,sleep],[]).

 

No

?-

 

We see that it correctly handles number agreement. Let us now see how it manages relative clauses. We will input a sentence with a relative clause and a transitive verb, “the girls that slept eat the pears”:

 

?- s(Prob,Tree,[the,girls, that,slept,eat,the,pears],[]).

 

Prob = 0.0430032

Tree = s(np(det(the), n(girl), rel(comp(that), vp(v(slept)))), vp(v(eat), np(det(the), n(pears))))

 

We can see that this grammar will not accept any transitive verbs without their corresponding NP, e.g., “the girl ate”:

 

?- s(Prob,Tree,[the,girl,ate],[]).

 

No

?- 

 

We can also see that due to the structure building we will not get unnecessary positing of empty relative clauses in sentences such as “the boys eat the pears”:

 

?- s(Prob,Tree,[the,boys,eat,the,pears],[]).

 

Prob = 0.0556512

Tree = s(np(det(the), n(boy)), vp(v(eat), np(det(the), n(pear))))

 

Without this structure building we would have a rel(empty) posited at the end of the first and second NP phrases.

 

3.5 Further Development

 

Given the above PCFG, it is possible to develop algorithms that are similar in function to the Viterbi algorithm that was outlined previously in the section on part-of-speech tagging. We must assume that the probability of a constituent being derived by a rule Rj is independent of how the constituent is used as a subconstituent. For example, this assumption would imply to us that the probabilities of NP rules are the same whether the NP is the subject, the object of a verb, or the object of a preposition, whereas, for example, subject NPs are much more likely to be pronouns than other NPs placed in the same position.

          Taking this assumption into account we can now develop a formalism based on the probability that a constituent C generates a sequence of words wi, wi+1,…, wj, written as wi,j. This probability is called the inside probability as it is allocating a probability to the word sequence inside the constituent. We write this as follows:

 

PROB(wi,j|C)

 

Using this formalism the probabilities of specific parse trees for a sentence can be found using a standard chart-parsing algorithm, where the probability of each constituent is computed from the probability of its subconstituents and the probability of the rule used. More specifically we can formalise this by saying, when we enter an item E of category C on the chart, using rule i with n subconstituents corresponding to chart entries E1,…,En:

 

PROB (E) = PROB(Rule I|C) * PROB(E1) *…* PROB(En)

 

Ultimately, a parser constructed using these methods does not work as well as we might have hoped, though it does help. Generally, these techniques are found to find the correct parse in about 50% of situations. One critical issue is the handling of lexical items and that the model assumes that the probability of a particular verb being used in a VP rule is independent of which rule is being considered. This means lexical preferences for certain rules cannot be handled within the models basic framework.

          While pure probabilistic CFGs clearly have their faults, the techniques they develop are paramount in making generalizations on developing more efficient parsing methods such as context-dependent best-first parsers, etc. Because of the problems associated with them, most current probabilistic parsing models use some augmentation of PCFGs rather than using the pure version. We will now summarise some of the key features involved with probabilistic context-free grammars (both reasons for usage and also their limitations):

 

 

3.6 Important Features of PCFGs

 

(Manning & Schütze, 1999)

·        PCFGs are good at Grammar Induction/Grammar Learning. In order to explain this more clearly we must first make an important distinction, between positive training examples and negative training examples. A ‘positive training example’ is contained within the language and is defined by the grammar, e.g., a sentence (or string). These examples are provided by a corpus. ‘Negative training examples’ are those, which are not defined by the language, e.g., “provided are examples these”, the first four words of the previous sentence in reverse order. It is not possible to learn a context-free language unless one has both ‘positive’ and ‘negative’ training examples. One of the advantages that PCFGs have over CFGs is in grammar learning. CFGs cannot be learned (in the sense that one can identify a grammar if one is allowed to see as much data produced by the grammar as one wants) without the use of ‘negative’ evidence. Whereas, PCFGs can be trained from ‘positive’ evidence alone.

 

·        One of PCFG’s most obvious advantages is that of syntactic ambiguity. As grammars expand, they tend to increase in ambiguity and one gets structurally different parses for each word sequence. PCFG’s give an ordering of the parses by assigning each one a probability, thus giving some idea of the plausibility of different parses. However, by itself a PCFG’s ability to order parses is quite poor, since its probability estimates are structurally based and do not take lexical co-occurrence into consideration.

 

·        PCFGs give a probabilistic language model for English, (where a CFG does not). This is the case as any string of English is simply a sequence of words and the probability of the whole would simply be the product of the probabilities of the individual words. However, it turns out not to produce a very good model as we see from the following example:

A PCFG is said to be consistent if the sum of the probabilities of all the sentences in the language is equal to 1. However, certain types of recursive rules cause a grammar to be inconsistent, by causing infinitely looping derivations for some sentences. For example, a rule S à S with a probability of 1 would lead to lost probability mass into infinite trees that do not generate strings of the language due to derivations that never terminate. However, it is not essential to have proper probability distributions, especially if we are only comparing the size of different probability estimates. Yet, in practice, a PCFG is a worse language model than an n-gram model (for n > 1). This is so as n-gram models take some local lexical context into consideration.

 

·        We see from above that, on their own as pure PCFGs, they are not good language models. However, we can hope to combine the strengths of n-gram models and PCFGs in order to facilitate a more efficient language model. 

 

·        A PCFG is a robust device for parsing. Most real text tends to contain errors and deficiencies. However, a PCFG is capable of dealing with these deficiencies as it rules out nothing in the grammar. Instead it just gives implausible sentences a low probability rating. Lets look at an example from the encoded PCFG. If we feed an ‘off the wall’ sentence to the parser, which is nonetheless grammatical, we obtain the following results:

 

?- s(Prob, T,[swat, flies, ants, like, swat, swat, ants],[]).

 

Prob = 7.68e-009

T = s(np(n(swat)), vp(v(flies), np(n(ants), pp(p(like), np(n(swat), np(n(swat), np(n(ants))))))))

 

We see that, while the sentence is still technically grammatical our parser gives it a very low probability rating.

 

·        Unfortunately, PCFGs are biased. Generally speaking, with a PCFG the probability of a smaller tree is greater than a larger tree. Of course, this assumption is not ridiculous by any means, but it does not give a sensible model of actual sentences. We see that our PCFG gives too much of the probability mass to the very short sentences:

 

?-np(Prob,T,[swat,swat,swat,flies],[]).

 

Prob = 1.8000000000000013e-007

T = np(n(swat),np(n(swat),np(n(swat),np(n(flies)))));

 

?- np(Prob,T,[swat,flies],[]).

 

Prob = 1.8000000000000006e-003

T = np(n(swat),np(n(flies))),

 

 

Non-terminals with a small number of expansions will be favoured by PCFGs over non-terminals with many expansions, since the individual rewritings will have a much higher probability. Such a distribution of probabilities may be seen as improper, though in practice, improper distributions are not that much of a problem. Often, it doesn’t really matter if these distributions are improper, especially in cases where we are only comparing the magnitude of different probability estimates. Moreover, providing we gather our statistics from parsed training corpora, we should always obtain a proper probability distribution.

 

 

4. Conclusion

 

Now that the processes behind tagging and parsing using a PCFG are well documented I feel that I have a sound basis for a more complex and efficient implementation in 4th year. At present, I would be interested in implementing a Context-Dependent Best-First Parser, which parses sentences in a treebank. These ‘Best-First’ parsers work on algorithms which explore high-probability constituents first. The hope is that with this type of implementation we will have a parser that finds the best parse both, first and quickly. It is clear that PCFGs have done little for real parsing efficiency, yet the theory behind them and how they are tagged is easily progressed upon when regarding a Context-Dependent Best-First Parser, e.g., context-dependent lexical probabilities can be computed using an algorithm similar to the Viterbi algorithm, which computes the forward probabilities. These statistically based methods show great promise in addressing the ambiguity resolution problem in natural language processing and such a parser that uses the ‘Best-First’ strategy can dramatically improve on things. Unfortunately, for this project the context-free framework showed us its limitations in identifying the intuitively correct parse tree. However, I feel that the implementation of a context-dependent parser in 4th year would be impossible with this foundation for probabilistic parsing. Context-dependent methods are capable of significantly improving our results and I hope to be able to manipulate their strengths with an efficient implementation next year.

 

 

 

5. Bibliography

 

5.1 Books

 

James Allen, “Natural language understanding” 2nd edition – Redwood City, Calif.: Benjamin/Cummins Pub. Co., 1995. (Chapter 7: Ambiguity Resolution).

 

Jurafsky & Martin “Speech and Language processing: an introduction to natural language processing”, Upper Saddle River, N.J.: Prentice Hall, 2000.

 

Eugene Charniak, “Statistical language learning”, Cambridge, Mass.: MIT Press, 1993.

 

Manning, Christopher D. & Schütze, H., “Foundations of statistical natural language processing”, Cambridge, Mass.: MIT Press, 1999.

 

 

5.2 Online References
Source
Last Visited
http://www.cse.unsw.edu.au/~billw/nlpdict.html
20.8.01
http://www.coli.uni-sb.de/~thorsten/tnt/
26.8.01
http://www.uni-giessen.de/~g91062/Seminare/gkcl/Allen95/al199507.htm#chap7_5
29.8.01
http://www.cogs.susx.ac.uk/lab/nlp/gazdar/teach/nlp/nlpnode1.html
19.8.01
http://www.cogs.susx.ac.uk/lab/nlp/gazdar/teach/nlp/nlpnode63.html

http://www.cogs.susx.ac.uk/lab/nlp/gazdar/teach/nlp/nlpnode64.html

http://www.cogs.susx.ac.uk/lab/nlp/gazdar/teach/nlp/nlpnode65.html

http://www.cogs.susx.ac.uk/lab/nlp/gazdar/teach/nlp/nlpnode66.html

http://www.cogs.susx.ac.uk/lab/nlp/gazdar/teach/nlp/nlpnode67.html

http://www.cogs.susx.ac.uk/lab/nlp/gazdar/teach/nlp/nlpnode74.html

19.8.01
http://www.ling.gu.se/~nivre/kurser/wwwstat/probability.html

http://www.ling.gu.se/~nivre/kurser/wwwstat/langmod.html

http://www.ling.gu.se/~nivre/kurser/wwwstat/tag_pars.html

http://www.ling.gu.se/~nivre/kurser/wwwstat/pdcg.html

15.8.01
http://www.ling.gu.se/~lager/Spaghetti/spaghetti.html
22.8.01
 

 

Back home

Back to top of page