The purpose of this project is to initially make an overview
of statistical approaches to NLP. This objective is achieved by firstly
outlining the probability theorem used in statistical NLP and then applying
this theorem to ‘Part-of-Speech Tagging’ (POS tagging) and probabilistic
parsers. I have chosen to then concentrate on the area of Probabilistic
Context-Free Grammars and to investigate at a deep level how these PCFGs
can be used as parsers and to examine how efficiently they work in the area of
sentential disambiguation by implementing a PCFG as a parser in Prolog. By
acquainting myself strongly with these topics and providing myself with
concrete information on these areas, I believe that this research will provide
a strong basis for a 4th year implementation, regarding Context-Dependent
Best-First parsers. The following contents should help provide a better
overview of the structure of my project:
Contents
1. Relevant
probability Theory:
·
1.1 An Introduction to Probability
Theory
·
1.4 Statistical Independence
·
2.3 Hidden Markov Models (HMMs)
3. Probabilistic Context-Free Grammars
·
3.2 Gathering Statistics for a PCFG
·
3.3 Computing Parse Probabilities
·
3.4 Mini Implementation and Testing
·
3.6 Important Features of PCFGs
1.1
An Introduction to Probability Theory
Let
us first explain some of the terms used with regard to statistical NLP:
=> Ω = {Head, Tail}.
Intuitively,
the probability of some event is the likelihood that the event will occur. A
probability of 1 indicates that the event is certain to occur and a probability
of 0 indicates the event will definitely not occur. Therefore a probability of
0.5 would indicate that an event is equally likely to occur as not to occur.
Probability can be defined more formally in terms of a random variable.
A random variable may range over a predefined set or values or over infinite
sets and continuous values. Here we will use random variables which range over
a finite set of values. Consider the throwing of a die. The random variable
THROW representing the result of throwing a die ranges over the six possible
values {1,2,3,4,5,6}. No other value for THROW is available. Therefore we will
talk of the probability of THROW = 1 simply as the probability of the event 1.
A probability function PROB,
assigns a probability to every possible value of a random variable. Every
probability function must have the following properties, where e1,…,en are the possible distinct values of a
random variable E:
1.
PROB(ei)≥0, for all i.
2.
PROB(ei)≤1, for all i.
3.
∑i=1,
n PROB(ei)=1.
A
very important mathematical notion involved in probabilistic parsing and
ambiguity resolution, is that of Conditional Probability. It is defined
by the formula:
PROB(e|e′)
= PROB(e & e′) / PROB(e′)
Where
PROB(e & e′) is the probability of the two events
occurring simultaneously.
The
following example explains the formulas’ usage more clearly:
Given
weather readings for 100 days of the year, where, on a particular day it either
rained or not, a weather reading is represented as a random variable R that has
one of two values, Rain or Dry. Say that it rained 30 times overall. Therefore,
the probability that it rained is PROB(R=Rain) = 0.3 and the probability that
the day was dry is PROB(R=Dry) = 0.7. Yet, we have other variables that are
related to each other. If we consider another random variable C representing
cloud cover and ranging over the values sunny and overcast. Let us say that it
was overcast on 60 of the 100 days that were observed. Of these 60 days, it
rained on 30 occasions. Therefore, intuitively, if you were given the fact that
it was overcast, i.e., that C = Overcast, then the probability that R = Rain
would be 0.5 (30 out of 60). This intuition is captured in the concept of
conditional probability and is written as PROB(Rain|Overcast). This could also
be described as the probability of the event Rain occurring given the event
Overcast. Let us see if the above formula agrees with our intuition:
We
know that PROB(Overcast) = 0.6 and PROB(Rain & Overcast) = 0.3, and using
the definition of conditional probability you can compute PROB(Rain|Overcast)
and see that it agrees with our intuition:
PROB(Rain|Overcast) = PROB(Rain & Overcast)
PROB(Overcast)
=
0.3 / 0.6
= 0.5
Another
very important mathematical notion, which must be explained, is called Bayes’
Rule. Bayes’ Rule relates the conditional probability of an event A given B
to the conditional probability of B given A:
PROB(A|B)=PROB(B|A)
* PROB(A)
PROB(B)
This
rule can be illustrated simply using the previous example. Using the rule we
are able to compute the probability that it was overcast on a day that on which
it rained (which will of course = 1, as in our example it is always overcast on
a rainy day)
PROB(Overcast|Rain)
= (PROB(Rain|Overcast) * PROB(Overcast))
PROB(Rain)
=
(0.5 * 0.6) / 0.3
= 1
Bayes’ Rule is sometimes useful in estimating probabilities in the event of the absence of some of the information required.
1.3 Independence (statistical)
Another
notion to be considered is that of independence (statistical). Two
events A and B are said to be statistically independent if PROB(A|B) = PROB(A)
– i.e., whether or not B has in fact occurred has no effect on the probability
that A will occur. Using the definition of conditional probability, this can be
reformulated as:
PROB(A
& B) = PROB(A) * PROB(B)
Now, let us try and relate this concept to Natural Language Processing. At first let us consider a simple case that helps us to illustrate the ideas with regard to the probability theory (we will go into more detail in section 2 on ‘part-of-speech tagging’): First, let us say that we need to identify the words that can be either nouns or verbs. We can formalize this using two random variables, one, C, which ranges over the parts of speech involved (in this case N and V) and two, W, which ranges over all the possible words. Let W = rains. The problem is to determine whether PROB(C=N| W=rains) or PROB(C=V|W=rains) is greater. The random variables are generally omitted from the formulas, therefore PROB(C=N| W=rains) will be written as PROB(N|rains) for short. Thus, using the definition of conditional probability we calculate the probabilities for the different categories as follows:
PROB(N|rains) = PROB(rains &
N)
PROB(rains)
PROB(V|rains)
= PROB(rains & V)
PROB(rains)
We
can reduce the problem this problem to:
PROB(N|rains) = PROB(rains & N)
PROB(V|rains)
= PROB(rains & V)
since
the denominator is the same in both cases.
Let
us say that in order to obtain the probabilities for these cases we have a
tagged corpus containing 1,000,000 words. Now let’s say that the word rains occurs
in the corpus 1000 times, 700 times in a verb sense and 300 times in a noun
sense. Therefore, the probability of a randomly selected word being rains as
a noun and rains as a verb are as follows:
PROB(rains & N) "a 300 /
1,000,000
= 0.0003
PROB(rains & V) "a 700 /
1,000,000
= 0.0007
Therefore
the most likely outcome is that rains will appear as a verb. We can now compute the probability that rains
is a verb using the formula:
PROB(V|rains) = PROB(V &
rains)
PROB(rains)
=
0.0007 / 0.001
= 0.7
Part-of-speech
tagging (‘tagging’ for short) is the process of going through a corpus
of sentences and labelling each word in each sentence with its ‘part-of-speech’
or lexical class marker. A tagged corpus is a corpus that has been so labelled.
A tag is one of the labels. Large-scale corpora, such as the Penn Treebank
might have up to 40 different tags contained in its tagset. The following is an
example of some of the tags used in the Penn treebank (Jurafsky & Martin, 2000, pg.297):
Tag |
Description |
Example |
|
CC |
Coordinating
conjunction |
and,
but, or |
|
VB |
Verb,
base form |
eat |
|
JJ |
Adjective |
yellow |
NNS
|
Noun, plural |
cats |
RBS |
Adverb,
superlative |
fastest |
TO |
To |
to |
Part-of-speech
tagging is an area involved in the front end of probabilistic parsing that,
given a sentence, containing ambiguous words, helps us determine the most
likely sequence of syntactic categories for the words in the sentence. Given,
that the core task of POS tagging is to choose the correct tag for each word in
context from a set of possible tags, then if the tagging (or disambiguation)
method is efficient, then our parser is more likely to act efficiently,
minimising error.
In section 1.3 on statistical independence
the basic idea behind POS tagging was explained, yet this method (called
the ‘standard unigram baseline method’) was very primitive and chose the
interpretation that occurred most frequently in the corpus without taking any
context/collocations into account. In the above example we obtained a poor
success rate of 70%, which will not get us far in a real world implementation.
Yet, the word examined was ambiguous and the result does not give a true
reflection of the results that might be obtained with the use of a much larger
data source such as a treebank. Over half the words in corpora are not
ambiguous and surprisingly using this method, on average we often obtain a
success rate of over 90% (Allen 1995, p.195). We still need to improve our percentages and in order to
do so we must consider more contexts, e.g., the sentences in which rains occurs.
Yet, when we design a new classification algorithm, we should always compare
its efficiency against the unigram
baseline.
The important information we need is
concerned with word collocations, that is, what words tend to appear together.
We saw that choosing the verb sense of rains in our corpus appeared to
be the right choice and would be correct about 70% of the time. Yet, if rains
was preceded by the determiner the, it would much more likely be a noun.
The following techniques are capable of exploiting the local context of the
sentence in which the word appears:
(Allen, 1995, p.196) Let w1,…,wT
be a sequence of words. We want to find the sequence of lexical categories C1,…,CT
that maximizes
1.
PROB(C1,…,CT|w1,…,wT)
Unfortunately,
it would take far too much data to generate reasonable estimates for such
sequences directly from a corpus, so we must apply more indirect methods. However,
there are methods that allow us to approximate. In order for us to develop
them, we must reformulate the problem using Bayes’ rule, which restates PROB(C1,…,CT|w1,…,wT) as
2.
PROB(C1,…,CT) * PROB(w1,…,wT|C1,…,CT)/PROB(w1,…,wT)
As
the common denominator PROB(wi,…,wT) does not affect the
answer(since we are interested in finding C1,…,CT that
gives the maximum value), in effect we want a sequence C1,…,CT that
maximizes the formula
3.
PROB(C1,…,CT) * PROB(w1,…,wT|C1,…,CT)
These
probabilities can be approximated by probabilities that are easier to collect
by making some independence assumptions. Though these independence assumptions
are not really valid, the estimates tend to work quite well in practice. In
making these assumptions we use n-grams which are detailed below.
An
n-gram is an n-tuple of things, for example words or lexical categories.
Suppose that we are concerned with n lexical categories L1, L2,…,Ln. The
term n-gram is used in statistical NLP in connection with the conditional
probability that a word will belong to Ln given that the preceding words were
in L1,L2,…,L[n-1]. This probability is written PROB(Ln|L[n-1]…L2 L1).
The first factor in formula 3 above,
PROB(C1,…,CT) can be approximated by a sequence of
probabilities based on a limited number of previous categories (n-gram models,
n representing the number of words in the particular pattern). The bigram model
uses PROB(Ci|Ci-1) which is the conditional probability
that a category Ci will follow a category Ci-1. The trigram
model uses PROB(Ci|Ci-2 Ci-1), which takes the
two preceding categories into account.
Using
bigrams the following approximation can be used:
PROB(C1,…,CT)
P ∏ (i=1, T) PROB(Ci|Ci-1)
In
order to account for the beginning of a sentence, we invent a pseudocategory, ø
at position 0 as the value of C0. Then the probability of a sequence
DET N V N using a bigram model would be:
PROB(DET N V N) P PROB(DET|ø) * PROB(N|DET) * PROB(V|N) * PROB(N|V)
The
second factor in formula 3
PROB(w1,…,wT|C1,…,CT)
can
be approximated by assuming that a word appears in a category independent of
the words in the preceding or succeeding categories. It is approximated by:
∏ (i=1,
T) PROB(wi|Ci)
With
these approximations, the problem has changed into finding the sequence C1,…,CT
that maximizes the following formula:
*∏ (i=1, T) PROB(Ci|Ci-1) * PROB(wi|Ci)
These probabilities can now be estimated from a tagged corpus. For example, the probability that a verb follows a noun can be estimated as follows:
PROB(Ci
= V|Ci-1 = N) P Count(N at position i-1 and V at i)
Count(N at position i-1)
For
the rest of this section we can resume the discussion of how to find the most
likely sequence of categories for a sequence of words as we now have a formula*
which is the motivation for an efficient tagger.
Most tagging algorithms fall into one
of two classes, either rule-based taggers or stochastic taggers.
‘Rule based taggers’ involve a large database of hand-written disambiguation
rules, which specify, for example, that an ambiguous word is a noun rather than
a verb if it follows a determiner. On the other hand ‘Stochastic taggers’
generally resolve tagging ambiguities by using a training corpus to compute the
probability of a given word having a given tag in a given context. For the sake
of this project we will discuss in detail the Stochastic tagger called
an HMM tagger (Hidden Markov Model tagger).
A
Hidden Markov Model is a generative device. It generates or accepts sequences
of words. In an HMM there are a set of states (lexical categories in our case)
with directed edges, transitions between states (labelled with transition
probabilities), and symbols emitted by the states (while in a state, one
outputs a single symbol before moving to the next state). There are two kinds
of probabilities associated with an HMM, transition probabilities, i.e. the
probability of a transition from one state to another, and output
probabilities, i.e. the probability of a certain state emitting a certain
symbol. The assumption is that a word depends probabilistically on just its own
‘part-of-speech’ (i.e. its tag), which in turn depends on the part-of-speech of
the preceding word (or on the start state in case there is no preceding word).
A weighted finite-state automata or
Markov Model is specified by the set of states Q, the set of transition
probabilities A, a defined start state and end states and a set of observation
likelihoods B. For weighted automata we will define the probabilities bi(ot)
as 1.0 if the state i matches the observation ot and 0 if they don’t
match. This is because we can only observe the states of the Markov
Model in training(the tags of the labelled corpus), but we only observe words
in applying the Markov Model(HMM) to the tagging class. The way in which an HMM
differs from a Markov Model is that we add two more requirements.
Fig.1 An example HMM
In summary, here are the parameters that define an HMM (Jurafsky & Martin, 2000):
Each aij represents the
probability of transitioning from state i to state j. The set of these is the ‘transition
probability matrix’.
In our example HMM we see that there is a special, non-emitting start state in use. We can avoid the use of these non-emitting states (which also include end states) with the specification of two more things.
· Initial distribution: an initial probability distribution over states, À, such that Ài is the probability that the HMM will start in state i. Of course some states j may have Àj = 0, so these cannot be initial states.
· Accepting states: a set of legal accepting states.
As is the case for weighted automata, the sequences of symbols that are input to the model or which are produced by the model are generally called the observation sequence, referred to as O=(o1,o2,o3…ot).
Using the above model, the probability of generating “dogcatchers catch old red fish” can be calculated as follows: first work out the probability of the lexical sequence N-V-ADJ-ADJ-N, which is 1 * 1 * 0.5 * 0.1 * 0.9 = 0.045, and then multiply this by the product of the output probabilities of the words, i.e. by 0.3 * 0.2 * 0.6 * 0.2 * 0.5 = 0.0036, for a final probability of 0.000162. More generally the formula for computing the probability of a sentence w1,…,wT given a sequence C1,…,CT is " (i=1, T)PROB(Ci|Ci-1) * PROB(wi|Ci) that was explained above. The rest of this section will explain and motivate this equation. The key insight to be taken from the above is that because of the Markov assumption (which states that the only information affecting the probability of an output, or the next state, is the prior state), you do not have to enumerate all the possible sequences. Sequences which happen to end in the same category are easily dealt with using a HMM. They can be collapsed together since the next category only depends on the previous one in the sequence. So, if we just keep track of the most probable sequence found so far for each possible ending category, we can then just ignore all the other less likely sequences. In order to illustrate this procedure in a more concise and clear manner I have taken an example transition diagram from (Allen, “Natural Language Understanding” 2nd ed. Chapter 7, ‘Ambiguity Resolution’) which represents a HMM.

Fig.2 The representation of a HMM, encoding the 256
possible sequences of ‘Flies like a flower’.
In order to find the most likely sequence of tags, we sweep forward through the words, one by one, finding the most likely sequence for each ending category. In other words, we find the four best sequences for the words ‘flies like’: the best ending with like as a V, the best as an N, the best as a P and the best as an ART. We then use this information to find the four best sequences for the words ‘flies like a’, each one ending in a different category. We keep repeating this process until we have accounted for all the words contained in the sequence. This process is described in the algorithm usually called the ‘Viterbi Algorithm’.
For a problem involving T words, and N lexical categories,
the Viterbi algorithm is guaranteed to find the most likely sequence using
k*T*N2 steps for some constant k. The reason we use this algorithm
and the reason why it is so effective is that it is applicable in a range of
situations that allows a space that apparently has an exponential number of
points in it to be searched in polynomial time. It is the most effective method
in finding the most likely sequence of tags for a sequence of words. In the above
example in Fig.2 we see that the search takes (NT = 44)
256 steps to find the most likely sequence. Using the Viterbi algorithm the
search would k*4*16 steps, which is sure to be a significantly faster method
than the brute force method applied above. Let us consider the algorithm in more detail (Manning & Schütze, 1999, p.332):
Commonly we want
to find the most likely complete path, i.e.:
arg max P(X|O,
μ)
X
In order to do
this, it is sufficient to maximize for a fixed O:
arg max
P(X,O|μ)
X
The Viterbi
algorithm will compute this path. Define:
δj (t)
= max P(X1…Xt-1,o1…ot-1,Xt
= j|μ)
This variable
stores for each point along the trellis the probability of the most probable
path that leads to that node. The corresponding variable ψj(t)
then records the node of the incoming arc that led to the most probable path
through the whole trellis as follows:
1.
Initialisation step
δj (1)
= πj, 1 ≤ j ≤ N
2. Induction
step
δj (t+1) = max δi(t)aijbijot,
1 ≤ j ≤ N
1 ≤ i ≤ N
Store backtrace
ψj(t+1) = arg max
δi(t)aijbijot,
1 ≤ j ≤ N
1 ≤ i ≤ N
3. Termination
and path readout (by backtracking). The most likely state sequence is worked out from the right
backwards:
XT+1 = arg max δi(T+1)
1 ≤ i ≤ N
Xt = ψXt+1(t+1)
P(X) = max δi(T+1)
1 ≤ i ≤ N
Algorithms like this can perform effectively if the probability estimates are computed from large-scale corpora that contain the same style data as the input to be classified. The error rate can be cut in half using this type of technique. It is now clear how part-of-speech taggers can be used as a front-end to parsers and we will come to understand their close relation in the next section.
Probabilistic context free grammars (PCFG’s)
are one of the many ways of building probabilistic models of syntactic
structure. A probabilistic (sometimes called ‘stochastic context-free
grammar’) is simply an augmentation of a context free grammar with probabilities
assigned to the grammar rules, indicating how likely different rewritings are.
They are the simplest and most natural probabilistic models for recursive
embedding within tree structures. The probabilistic theory behind them is well
understood and the algorithms for them are a progression from the algorithms
employed with such models as HMM’s (Hidden Markov Models). (Allen, 1995)Finite
state machines, such as HMM’s can be generalized to the probabilistic case.
Context free grammars can also be generalized. In order to do this we must have
some statistics on our grammar rule use. The simplest method is to count the
number of times a rule appears in a corpus of parsed sentences and use this to
estimate the probability of the rule being used. Fig.3 on page 15 shows a
probabilistic CFG with the probabilities derived from analysing a corpus of
parsed sentences. This PCFG provides the user with a computational device that
is sufficient enough to be generalised so that it can simulate various forms of
probabilistic conditioning, in this case, sentential disambiguation.
(Manning
& Schütze, 1999) A context-free grammar G is defined by four
parameters, (N, Σ, P, S):
·
A set of
non-terminal symbols (or ‘variables’) N
·
A set of
terminal symbols Σ (disjoint from N)
·
A set of
productions P, each of the form A → β, where A is a
non-terminal and β is a string of symbols from the infinite set of
strings (Σ U N)
·
A designated
start symbol S
A probabilistic context-free grammar
augments each rule in P with a conditional probability:
A
→ β [p]
Therefore a PCFG is a 5-tuple G = (N,
Σ, P, S, D) consisting of:
·
A set of
non-terminal symbols (or ‘variables’) N
·
A set of
terminal symbols Σ (disjoint from N)
·
A set of
productions P, each of the form A → β, where A is a
non-terminal and β is a string of symbols from the infinite set of
strings (Σ U N)
·
A designated
start symbol S
·
A function D
which assigns probabilities to each rule in P. This function D
expresses the probability p that the given non-terminal A will be
expanded to the sequence β. It is often written as:
P(A → β)
Note that when we write P(A → β)
we always mean P(A → β|A). That is, we are giving the probability
distribution of the daughters of a certain head. In formal terms, this is the
conditional probability of a given expansion given the left-hand-side
non-terminal A. Thus, if all the possible expansions of the non-terminal were
considered the sum of their probabilities would equal to 1. Such a grammar can
be used either to parse or generate sentences of the language.
Intuitively, P(A → β)
means the probability of expanding the non-terminal A using this rule as
opposed to any of the other rules for A. As with non-probabilistic versions of
context-free grammars, we can look at a PCFG as defining a context-free
language that specifies how to expand the starting symbol into the strings
in the language or, conversely, how to assign a structure to a given string in
the language. The main difference is that a probabilistic context-free
grammar also assigns a probability to the string such that all the
probabilities of all the strings sum to one. Fig.3 shows an example
probabilistic CFG taken from (Charniak 1993, p.76).
|
S → NP VP : 0.8
S → VP : 0.2
NP → NOUN : 0.4 NP → NOUN PP : 0.4 NP → NOUN NP : 0.2 VP → VERB : 0.3 VP → VERB NP : 0.3 VP → VERB PP : 0.2 VP → VERB NP PP : 0.2 PP → PREP
NP : 1.0 |
PREP → like
:1.0 VERB → swat
:0.2 VERB → flies :0.4 VERB → like
:0.4 NOUN → swat
:0.05 NOUN → flies
:0.45 NOUN → ants
:0.5 |
Fig.3 An example PCFG
The above PCFG contains grammar rules with
assigned probabilities. The non-terminal symbols are S, NP, VP, PP, PREP, VERB
and NOUN. The terminal symbols are the words written in italics. Before we go
into any more detail on the function of a PCFG we must ask the question, where
have these PCFG probabilities come from? And how can we obtain reliable
statistics?
3.2 Gathering statistics for a PCFG
Before we
investigate how to compute parse probabilities we must first ask the question,
where do PCFG probabilities come from? There are two possible methods of
assigning probabilities to a grammar. The simplest method is to use a corpus of
already parsed sentences. Such a corpus is called a treebank. In section 2 we
discussed how part-of-speech tagging methods help us to determine the most
likely sequence of syntactic categories for the words in a sentence. Given,
such a treebank with part-of-speech tags already assigned we can use
statistical methods to identify the common structures in English and favour
these in the parsing algorithm. The probability of each expansion of a
non-terminal can be computed by counting the number of times that expansion
occurs and then normalizing.
PROB(a -> b|a) = Count(a -> b)
= Count(a -> b)
∑g Count(a -> g) Count a
This part-of-speech
tagging method allows us to choose the most likely interpretation when a
sentence is ambiguous and will, in time lead to very efficient parsers. They
allow for a considerable amount of ambiguity to be resolved before our parser
has even begun to work. Yet, if the tagging is incorrect we are left with
considerable problems. The parser may be prevented from ever finding the
correct interpretation for a sentence or, even worse the parser may find a
valid, but implausible interpretation based on a badly tagged word and never be
able to realise its error. Even though, such algorithms as the Viterbi
algorithm have a tagging success rate of up to 96%, they do have their
limitations with the chances of a twelve-word sentence being correctly tagged being 46%.
When a treebank is unavailable then
the counts needed for calculating PCFG probabilities can be generated by first
parsing a corpus. In an ideal world where sentences were unambiguous, it would
be as simple as this: parse a corpus, increment a counter for every rule in the
parse, and then normalize these to obtain our probabilities. However, since
this is not the case and most sentences are ambiguous, in reality we need to
keep a separate count for each parse of a sentence and weight each partial count
by the probability of the parse it appears in. The algorithm used for computing
this method is called the Inside-Outside algorithm, and was proposed as
a generalisation of the Forward-Backward algorithm. The Inside-Outside
algorithm is a training algorithm which allows us to train the parameters of a
PCFG on unannotated sentences of the language. The basic assumption behind them
is that a good grammar is one that makes the sentences in the training corpus
likely to occur, and hence we seek the grammar that maximizes the likelihood of
the training data. Again the details regarding this area are outside the scope
of this project and the idea behind the algorithm has been described merely as
a way to illustrate the methods of assigning probabilities to PCFG’s.
3.3 Computing Parse Probabilities
Now that we know where these probabilities are derived, we must ask the
question, how are these probabilities used to solve parsing problems? With this
PCFG we can now develop algorithms, which given a sentence help us to estimate
probabilities that will find the most likely parse tree that could have
generated that sentence. This attribute of a PCFG is particularly useful in
solving the problem of ambiguity. For example, let us consider the two
parses of the sentence “Swat flies like ants” (one meaning a command to swat
flies and the other, less intuitive meaning is taken in relation to a certain
type of ‘swat fly’ that likes ‘ants’). Two of the possible tree structures are
represented in Figures 4.1 and 4.2 below (Charniak 1995).

Fig.4.1 (T1) Fig.4.2 (T2)
The PCFG assigns
a probability to each parse tree T, which is a derivation of a sentence S. The
probability of each parse tree T is defined as the product of the probabilities
of all the rules r used to expand each node n in the parse tree:
P(T, S) = ∏ P(r(n))
nЄT
The resulting
probability P(T, S) is both the joint probability of the parse and the
sentence, and also the probability of the parse P(T). How can this be true?
First, by the definition of joint probability we have:
P(T, S) = P(T) P(S|T)
But as a parse tree includes all of the words of the
sentence, then P(S|T) is
P(T, S) = P(T)
P(S|T) = P(T)
The probability
of each of the trees in Fig.4 can be calculated simply by finding the product
of the probabilities of all the rules used in the derivation. For example, the
probability of the tree in Fig.4.1 (T1) and the tree in Fig.4.2 (T2) can be
calculated as follows, by taking the probabilities from the PCFG in Fig.3:
P(T1) = 0.8 * 0.2 * 0.3 * 0.4 * 0.05 * 0.45 * 0.4 * 0.4 *
0.5
= 3.456 *
10-5
P(T2) = 0.2 * 0.3 * 0.2 * 0.4 * 0.45 * 1.0 * 1.0 * 0.4 *
0.5
= 4.32 * 10-3
We see that T2
has a higher probability and, thus this parse would be the one chosen by a
disambiguation algorithm, which selects the parse with the highest PCFG
probability. The most probable parse was also intuitively the most likely and
therefore it was successful in both the parsing and disambiguation of the
sentence. We will now look at how the above probabilistic CFG can be
implemented as a parser. (Re:
http://www.ling.gu.se/~nivre/kurser/wwwstat/pdcg.html) It is a quite
straightforward method, which involves encoding a context-free grammar using
the built in DCG (definite clause grammar) available in most implementations of
Prolog. All that is required to assign probabilities is that every category
symbol (term) is extended with an extra argument for the probability of that
constituent, and that every rule is extended with a Prolog call in order to
multiply the probabilities of the daughters with the probability of the rule in
order to obtain the probability of the entire constituent. Thus a PCFG rule of
the form:
x0 à x1…xn
: p
Where p is the
rule probability, will be translated into the following DCG rule:
x0 (P0) à x1(P1),…,xn(Pn),
{P0 is p * P1 *…* Pn}.
When using the
grammar for parsing, it is usually convenient to extend each category symbol
(term) with yet another argument where the parse tree is built in the form of a
term. Thus, we have:
x0 (P0,x0(X1,…,Xn))
àx1(P1,
X1),…,xn(Pn, Xn), {P0 is p * P1 *…* Pn}.
Parsing a string S (represented as a list of words)
using this DCG style encoding, is done by following the Prolog call:
?- x0(P, T, S, []).
Where T is instantiated to a term representing a parse
tree, while P is instantiated to the probability of T. We will encode the above
PCFG from Fig.3 in a DCG (Prolog) format in Fig.5 as our mini implementation.
3.4 Mini-Implementation and Testing
%Pcfg.pro
%An example probabilistic
CF-PSG in prolog as a DCG %with extensions
s(P0,s(NP,VP)) -->
np(P1,NP), vp(P2,VP), {P0 is 0.8*P1*P2}.
s(P0,s(VP)) -->
vp(P1,VP), {P0 is 0.2*P1}.
np(P0,np(N)) -->
n(P1,N), {P0 is 0.4*P1}.
np(P0,np(N,PP)) -->
n(P1,N), pp(P2,PP), {P0 is 0.4*P1*P2}.
np(P0,np(N,NP)) -->
n(P1,N), np(P2,NP), {P0 is 0.2*P1*P2}.
vp(P0,vp(V)) -->
v(P1,V), {P0 is 0.3*P1}.
vp(P0,vp(V,NP)) -->
v(P1,V), np(P2,NP), {P0 is 0.3*P1*P2}.
vp(P0,vp(V,PP)) -->
v(P1,V), pp(P2,PP), {P0 is 0.2*P1*P2}.
vp(P0,vp(V,NP,PP)) -->
v(P1,V), np(P2,NP), pp(P3,PP), {P0 is 0.2*P1*P2*P3}.
pp(P0,pp(P, NP)) -->
p(P1, P), np(P2, NP), {P0 is 1.0*P1*P2}.
p(1.0,p(like))
-->[like].
v(0.2,v(swat))
-->[swat].
v(0.4,v(flies))
-->[flies].
v(0.4,v(like))
-->[like].
n(0.05,n(swat))
-->[swat].
n(0.45,n(flies))
-->[flies].
n(0.5,n(ants))
-->[ants].
Parsing the
string “swat flies like ants” is done with the following Prolog query:
?- s(Prob, T,[swat, flies, like, ants],[]).
Prolog gives us
back the following sentence parses and their associated probabilities:
1. Prob =
0.000256
T = s(np(n(swat)), vp(v(flies), pp(p(like),
np(n(ants))))) ;
2. Prob =
3.456e-005
T = s(np(n(swat), np(n(flies))), vp(v(like),
np(n(ants)))) ;
3. Prob =
0.000432
T = s(vp(v(swat), np(n(flies), pp(p(like),
np(n(ants)))))) ;
4. Prob =
0.000288
T = s(vp(v(swat), np(n(flies)), pp(p(like),
np(n(ants))))) ;
According to our
parser the most likely parse for the sentence “swat flies like ants” is that
represented in parse 3. This is the parse that was represented as a tree
structure in Fig.4.2. Parse number 2 is that represented by the tree structure
in Fig.4.1, which happens to be the least likely parse for the sentence. Of
course our program is not just limited to parsing this sentence. Let us look at
another example, “ants like flies”. The Prolog query in this case being:
?- s(Prob, T,[ants, like, flies],[]).
In this case
there is only one possible parse for the sentence and Prolog gives us back:
Prob = 0.003456
T = s(np(n(ants)), vp(v(like), np(n(flies)))) ;
This tells us
that, according to our PCFG this sentence is unambiguous as there is only one
possible parse. Yet, intuitively there can be more than one parse, such as one
that says ‘ants are like flies’. Yet, our grammar does not allow for the rule,
pp à p, as this particular PCFG is limited to a small grammar. Though our
PCFG will allow for this type of intuition if we declare this as np à noun pp:
?- np(Prob,T,[ants,like,flies],[]).
Prob = 3.6 * 10-2
T = np(n(ants),pp(p(like),np(n(flies))));
I have also
encoded another grammar in Fig. 6 below, by taking a DCG from my second year
lecture notes and adding the required extensions to encode the probabilities.
The terminal rule probabilities are not based on a corpus and were made up
simply for expository purposes, whereas I obtained the lexical rule
probabilities from an online statistical part-of-speech tagger
(http://www.coli.uni-sb.de/~thorsten/tnt/), which gathers its information from
the Penn Treebank. It indicates how simply we can apply simple part-of-speech
tagging methods to a grammar in order to make it a working PCFG. This encoding of
a PCFG also includes some relative clauses, number agreement and structure
building.
|
%Fig.
6 Mini-Implementation %pcfg3.pro %An
example probabilistic CF-PSG in Prolog as a DCG %This
version contains number agreement and structure %building. The lexical
probabilities are taken from a %tagger used on the Penn Treebank s(P0, s(NP,VP))--> np(Num,P1,NP), vp(Num,P2,VP),
{P0 is 1.0*P1*P2}.
%Calculates
rule probability np(Num, P0, NP)--> det(Num,P1,Det), n(Num, P2, N), rel(Num,P3,R), {P0 is
P1*P2*P3}, {build_np(Det,N,R,NP)}.
np(Num, P0, np(PN))--> pn(Num,P1,PN), {P0 is 0.35*P1}. vp(Num,P0,vp(V,NP))--> tv(Num,P1,V), np(_,P2, NP), {P0
is 0.4*P1*P2}. vp(Num,P0,vp(V))-->iv(Num,P1,V), {P0 is 0.6*P1}. rel(Num, P0, rel(C,VP))-->comp(P1,C), vp(Num,P2,VP),
{P0 is
1.0*P1*P2}. rel(Num,1.0,rel(empty))-->[]. %empty
relative clause det(_,1.0,det(the))--> [the].
%lexical production rules det(sing,0.98,det(a))--> [a].
%determiners det(sing,0.07,det(that))--> [that]. n(sing,0.66,n(boy))--> [boy].
%nouns n(sing,1.0,n(girl))--> [girl]. n(sing,1.0,n(apple))-->[apple]. n(sing,0.31,n(pear))-->[pear]. n(pl,0.66,n(boys))--> [boys]. n(pl,1.0,n(girls))--> [girls]. n(pl,1.0,n(apples))-->[apples]. n(pl,0.31,n(pears))-->[pears]. pn(sing,0.74,pn(john))-->[john].
%proper nouns pn(sing,0.35,pn(mary))-->[mary]. tv(_,1.0,v(ate))-->[ate]. %transitive
verbs tv(sing,1.0,v(eats))-->[eats]. tv(pl,0.68,v(eat))-->[eat]. tv(_,0.91,v(saw))-->[saw]. tv(sing,1.0,v(sees))-->[sees]. tv(pl,0.6,v(see))-->[see]. iv(_, 1.0, v(slept))-->[slept]. %intransitive verbs iv(sing, 1.0, v(sleeps))-->[sleeps]. iv(pl, 0.44, v(sleep))-->[sleep]. iv(_, 1.0, v(wept))-->[wept]. iv(sing, 0.81, v(weeps))-->[weeps]. iv(pl, 1.0, v(weep))-->[weep]. comp(0.85,comp(that)) --> [that]. %These
allow us to manipulate the structure of our %sentence. When a relative has
content it is passed over.
build_np(Det,N,rel(empty),np(Det,N)). build_np(Det,N,rel(C,VP),np(Det,N,rel(C,VP))). |
Now, let us parse
some sentences using this PCFG. First of all we will see how the parser deals
with number agreement by inputting the sentence “the boy sleeps”. We query this
parser in the same method as the one developed in Fig.5:
?- s(Prob,Tree,[the,boy,sleeps],[]).
Prob = 0.396
Tree = s(np(det(the), n(boy)), vp(v(sleeps))) ;
We can see from
this parse that the parser correctly handles the sentence but will it catch an
ungrammatical sentence like “the boy sleep”:
?- s(Prob,Tree,[the,boy,sleep],[]).
No
?-
We see that it
correctly handles number agreement. Let us now see how it manages relative
clauses. We will input a sentence with a relative clause and a transitive verb,
“the girls that slept eat the pears”:
?- s(Prob,Tree,[the,girls, that,slept,eat,the,pears],[]).
Prob = 0.0430032
Tree = s(np(det(the), n(girl), rel(comp(that),
vp(v(slept)))), vp(v(eat), np(det(the), n(pears))))
We can see that
this grammar will not accept any transitive verbs without their corresponding
NP, e.g., “the girl ate”:
?- s(Prob,Tree,[the,girl,ate],[]).
No
?-
We can also see
that due to the structure building we will not get unnecessary positing of
empty relative clauses in sentences such as “the boys eat the pears”:
?- s(Prob,Tree,[the,boys,eat,the,pears],[]).
Prob = 0.0556512
Tree = s(np(det(the), n(boy)), vp(v(eat),
np(det(the), n(pear))))
Without this
structure building we would have a rel(empty) posited at the end of the first
and second NP phrases.
Given the above PCFG, it is possible to develop algorithms that are
similar in function to the Viterbi algorithm that was outlined previously in
the section on part-of-speech tagging. We must assume that the probability of a
constituent being derived by a rule Rj is independent of how the
constituent is used as a subconstituent. For example, this assumption would
imply to us that the probabilities of NP rules are the same whether the NP is
the subject, the object of a verb, or the object of a preposition, whereas, for
example, subject NPs are much more likely to be pronouns than other NPs placed
in the same position.
Taking this assumption
into account we can now develop a formalism based on the probability that a
constituent C generates a sequence of words wi, wi+1,…, wj,
written as wi,j. This probability is called the inside
probability as it is allocating a probability to the word sequence inside
the constituent. We write this as follows:
PROB(wi,j|C)
Using this formalism the probabilities of specific parse trees for a
sentence can be found using a standard chart-parsing algorithm, where the
probability of each constituent is computed from the probability of its
subconstituents and the probability of the rule used. More specifically we can
formalise this by saying, when we enter an item E of category C on the chart,
using rule i with n subconstituents corresponding to chart entries E1,…,En:
PROB (E) = PROB(Rule I|C) * PROB(E1) *…* PROB(En)
Ultimately, a parser constructed using these methods does not work as
well as we might have hoped, though it does help. Generally, these techniques
are found to find the correct parse in about 50% of situations. One critical
issue is the handling of lexical items and that the model assumes that the
probability of a particular verb being used in a VP rule is independent of
which rule is being considered. This means lexical preferences for certain
rules cannot be handled within the models basic framework.
While pure probabilistic
CFGs clearly have their faults, the techniques they develop are paramount in
making generalizations on developing more efficient parsing methods such as
context-dependent best-first parsers, etc. Because of the problems associated
with them, most current probabilistic parsing models use some augmentation of
PCFGs rather than using the pure version. We will now summarise some of the key
features involved with probabilistic context-free grammars (both reasons for
usage and also their limitations):
3.6 Important Features of PCFGs
(Manning & Schütze, 1999)
·
PCFGs are good
at Grammar Induction/Grammar Learning. In order to explain this more
clearly we must first make an important distinction, between positive
training examples and negative training examples. A ‘positive
training example’ is contained within the language and is defined by the
grammar, e.g., a sentence (or string). These examples are provided by a corpus.
‘Negative training examples’ are those, which are not defined by the language,
e.g., “provided are examples these”, the first four words of the previous
sentence in reverse order. It is not possible to learn a context-free language
unless one has both ‘positive’ and ‘negative’ training examples. One of the
advantages that PCFGs have over CFGs is in grammar learning. CFGs cannot be
learned (in the sense that one can identify a grammar if one is allowed to see
as much data produced by the grammar as one wants) without the use of
‘negative’ evidence. Whereas, PCFGs can be trained from ‘positive’ evidence
alone.
·
One of PCFG’s
most obvious advantages is that of syntactic ambiguity. As grammars expand,
they tend to increase in ambiguity and one gets structurally different parses
for each word sequence. PCFG’s give an ordering of the parses by assigning each
one a probability, thus giving some idea of the plausibility of different
parses. However, by itself a PCFG’s ability to order parses is quite poor,
since its probability estimates are structurally based and do not take lexical
co-occurrence into consideration.
·
PCFGs give a
probabilistic language model for English, (where a CFG does not). This is the
case as any string of English is simply a sequence of words and the probability
of the whole would simply be the product of the probabilities of the individual
words. However, it turns out not to produce a very good model as we see from
the following example:
A PCFG is said to be consistent if the sum of the probabilities of all
the sentences in the language is equal to 1. However, certain types of
recursive rules cause a grammar to be inconsistent, by causing infinitely
looping derivations for some sentences. For example, a rule S à S with a probability of 1 would lead to lost probability mass into
infinite trees that do not generate strings of the language due to derivations
that never terminate. However, it is not essential to have proper probability
distributions, especially if we are only comparing the size of different
probability estimates. Yet, in practice, a PCFG is a worse language model than
an n-gram model (for n > 1). This is so as n-gram models take some local
lexical context into consideration.
·
We see from
above that, on their own as pure PCFGs, they are not good language models.
However, we can hope to combine the strengths of n-gram models and PCFGs in
order to facilitate a more efficient language model.
·
A PCFG is a
robust device for parsing. Most real text tends to contain errors and
deficiencies. However, a PCFG is capable of dealing with these deficiencies as
it rules out nothing in the grammar. Instead it just gives implausible
sentences a low probability rating. Lets look at an example from the encoded
PCFG. If we feed an ‘off the wall’ sentence to the parser, which is nonetheless
grammatical, we obtain the following results:
?- s(Prob, T,[swat, flies, ants, like, swat, swat,
ants],[]).
Prob = 7.68e-009
T = s(np(n(swat)), vp(v(flies), np(n(ants),
pp(p(like), np(n(swat), np(n(swat), np(n(ants))))))))
We see that, while the sentence is still technically grammatical our
parser gives it a very low probability rating.
·
Unfortunately,
PCFGs are biased. Generally speaking, with a PCFG the probability of a smaller
tree is greater than a larger tree. Of course, this assumption is not ridiculous
by any means, but it does not give a sensible model of actual sentences. We see
that our PCFG gives too much of the probability mass to the very short
sentences:
?-np(Prob,T,[swat,swat,swat,flies],[]).
Prob =
1.8000000000000013e-007
T =
np(n(swat),np(n(swat),np(n(swat),np(n(flies)))));
?-
np(Prob,T,[swat,flies],[]).
Prob =
1.8000000000000006e-003
T =
np(n(swat),np(n(flies))),
Non-terminals with a small number of expansions will be favoured by
PCFGs over non-terminals with many expansions, since the individual rewritings
will have a much higher probability. Such a distribution of probabilities may
be seen as improper, though in practice, improper distributions are not that
much of a problem. Often, it doesn’t really matter if these distributions are
improper, especially in cases where we are only comparing the magnitude of
different probability estimates. Moreover, providing we gather our statistics
from parsed training corpora, we should always obtain a proper probability
distribution.
Now that the
processes behind tagging and parsing using a PCFG are well documented I feel
that I have a sound basis for a more complex and efficient implementation in 4th
year. At present, I would be interested in implementing a Context-Dependent
Best-First Parser, which parses sentences in a treebank. These ‘Best-First’
parsers work on algorithms which explore high-probability constituents first.
The hope is that with this type of implementation we will have a parser that
finds the best parse both, first and quickly. It is clear that PCFGs have done
little for real parsing efficiency, yet the theory behind them and how they are
tagged is easily progressed upon when regarding a Context-Dependent Best-First
Parser, e.g., context-dependent lexical probabilities can be computed using an
algorithm similar to the Viterbi algorithm, which computes the forward
probabilities. These statistically based methods show great promise in
addressing the ambiguity resolution problem in natural language processing and
such a parser that uses the ‘Best-First’ strategy can dramatically improve on
things. Unfortunately, for this project the context-free framework showed us
its limitations in identifying the intuitively correct parse tree. However, I
feel that the implementation of a context-dependent parser in 4th
year would be impossible with this foundation for probabilistic parsing.
Context-dependent methods are capable of significantly improving our results
and I hope to be able to manipulate their strengths with an efficient
implementation next year.
James Allen, “Natural language
understanding” 2nd edition – Redwood City, Calif.: Benjamin/Cummins Pub. Co.,
1995. (Chapter 7: Ambiguity Resolution).
Jurafsky & Martin “Speech and Language
processing: an introduction to natural language processing”, Upper Saddle
River, N.J.: Prentice Hall, 2000.
Eugene Charniak, “Statistical
language learning”, Cambridge, Mass.: MIT Press, 1993.
Manning, Christopher D. & Schütze, H., “Foundations
of statistical natural language processing”, Cambridge, Mass.: MIT Press, 1999.