Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Syntactic Processing

Syntactic Processing

Syntactic processing is widely used in applications such as question answering systems, information extraction, sentiment analysis, grammar checking etc. There are 3 broad levels of syntactic processing:

POS tagging is a crucial task in syntactic processing and is used as a preprocessing step in many NLP applications


POS tagging

The four main techniques used for POS tagging:

Hidden Markov Models

Markov processes are commonly used to model sequential data, such as text and speech. The first-order Markov assumption states that the probability of an event (or state) depends only on the previous state. The Hidden Markov Model, an extension to the Markov process, is used to model phenomena where the states are hidden and they emit observations. The transition and the emission probabilities specify the probabilities of transition between states and emission of observations from states, respectively. In POS tagging, the states are the POS tags while the words are the observations. To summarise, a Hidden Markov Model is defined by the initial state, emission, and the transition probabilities.

The POS tag TiT_i for given word WiW_i depends on two things: POS tag of the previous word and the word itself.

P(TiWi)=P(WiTi)P(Ti1Ti)P(T_i| W_i) = P(W_i|T_i) * P(T_i-1|T_i)

So, the probability of a tag sequence (T1,T2,T3,(T_1, T_2, T_3,,Tn), T_n) for a given the word sequence (W1,W2,W3,(W_1, W_2, W_3,,Wn), W_n) can be defined as:

P(TW)=(P(W1T1)P(T1start))(P(W2T2)P(T2T1))...(P(WnTn)P(TnTn1))P(T|W) = (P(W_1|T_1) * P(T_1|start)) * (P(W_2|T_2) * P(T_2|T_1)) * ...* (P(W_n|T_n) * P(T_n|T_{n-1}))

For a sequence of nn words and tt tags, a total of tnt_n tag sequences are possible.

Viterbi Heuristic

Viterbi Heuristic can deal with this problem by taking a greedy approach. The basic idea of the Viterbi algorithm is as follows - given a list of observations (words) O1,O2....OnO_1,O_2....O_n to be tagged, rather than computing the probabilities of all possible tag sequences, you assign tags sequentially, i.e. assign the most likely tag to each word using the previous tag.

More formally, you assign the tag TiT_i to each word WiW_i such that it maximises the likelihood:

P(TiWi)=P(WiTi)P(Ti1Ti)P(T_i| W_i) = P(W_i|T_i) * P(T_i-1|T_i)

where Ti1T_i-1 is the tag assigned to the previous word. The probability of a tag TiT_i is assumed to be dependent only on the previous tag Ti1T_{i-1}, and hence the term P(TiTi1)P(T_i|T_{i-1}) - Markov Assumption.

Viterbi algorithm is an example of a dynamic programming algorithm. In general, algorithms which break down a complex problem into subproblems and solve each subproblem optimally are called dynamic programming algorithms.

Learning HMM Model Parameters

The process of learning the probabilities from a tagged corpus is called training an HMM model. The emission and the transition probabilities can be learnt as follows:


Constituency Parsing

shallow parsing is not sufficient. Shallow parsing, as the name suggests, refers to fairly shallow levels of parsing such as POS tagging, chunking, etc. But such techniques would not be able to check the grammatical structure of the sentence, i.e. whether a sentence is grammatically correct, or understand the dependencies between words in a sentence.

Two most commonly used paradigms of parsing - constituency parsing and dependency parsing, which would help to check the grammatical structure of the sentence.

In constituency parsing, you learnt the basic idea of constituents as grammatically meaningful groups of words, or phrases, such as noun phrase, verb phrase etc. You also learnt the idea of context-free grammars or CFGs which specify a set of production rules. Any production rule can be written as A -> B C, where A is a non-terminal symbol (NP, VP, N etc.) and B and C are either non-terminals or terminal symbols (i.e. words in vocabulary such as flight, man etc.).

Example a CFG is:

There are two broad approaches to constituency parsing:

Top-down parsers have a specific limitation- Left Recursion.

Example of a left recursion: VP -> VP NP. Whenever a top-down parser encounters such a rule, it runs into an infinite loop, thus no parse tree is obtained. Following is the illustration of top-down parse:

Top-down parse

Figure 2:Top-down parse

Bottom-up parse

Figure 3:Bottom-up parse

Probabilistic CFG

Since natural languages are inherently ambiguous, there are often cases where multiple parse trees are possible. In such cases, we need a way to make the algorithms figure out the most likely parse tree. Probabilistic Context-Free Grammars (PCFGs) are used when we want to find the most probable parsed structure of the sentence. PCFGs are grammar rules, similar to what you have seen before, along with probabilities associated with each production rule. An example production rule is as follows:

NP -> Det N (0.5) | N (0.3) |N PP (0.2)

It means that the probability of an NP breaking down to a ‘Det N’ is 0.50, to an ‘N’ is 0.30 and to an ‘N PP’ is 0.20. Note that the sum of probabilities is 1.00. Overall probability for a parsed structure of the sentence is probabilities of all rules used in that parsed structure. The parsed tree with maximum probability is best possible interpretation of the sentence.

Chomsky Normal Form

The Chomsky Normal Form (CNF), proposed by the linguist Noam Chomsky, is a normalized version of the CFG with a standard set of rules defining how production rule must be written. The three forms of CNF rules can be written:

A, B, C are non-terminals (POS tags), a is a terminal (term), S is the start symbol of the grammar and ε is the null string. The table below shows some examples for converting CFGs to the CNF:

| CFG | VP -> V NP PP | VP -> V | | CNF | VP -> V (NP1) | VP -> V (VP1) | | | NP1 -> NP PP | VP1 -> ε |


Dependency Parsing

In dependency grammar, constituencies (such as NP, VP etc.) do not form the basic elements of grammar, but rather dependencies are established between the words themselves.

Free word order languages such as Hindi, Bengali are difficult to parse using constituency parsing techniques. This is because, in such free-word-order languages, the order of words/constituents may change significantly while keeping the meaning exactly the same. It is thus difficult to fit the sentences into the finite set of production rules that CFGs offer. Dependencies in a sentence are defined using the elements Subject-Verb-Object (SVO). The following table shows SVO dependencies in three types of sentences - declarative, interrogative, and imperative

Apart from dependencies defined in the form of subject-verb-object, there’s a non-exhaustive list of dependency relationships, which are called universal dependencies.

Dependencies are represented as labelled arcs of the form hd(l)h → d(l) where ‘hh’ is called the “head” of the dependency, ‘dd’ is the “dependent” and ll is the “label” assigned to the arc. In a dependency parse, we start from the root of the sentence, which is often a verb. And then start to establish dependencies between root and other words.


Information Extraction

Information Extraction (IE) system can extract entities relevant for booking flights (such as source and destination cities, time, date, budget constraints etc.) in a structured format from unstructured user-generated input. IE is used in many applications such as chatbots, extracting information from websites, etc.

A generic pipeline for Information Extraction is as follows:

Most IE pipelines start with the usual text preprocessing steps - sentence segmentation, word tokenisation and POS tagging. After preprocessing, the common tasks are Named Entity Recognition (NER), and optionally relation recognition and record linkage. NER is arguably the most important and non-trivial task in the pipeline. There are various techniques and models for building Named Entity Recognition (NER) system, which is a key component in information extraction systems:

IOB (or BIO) method tags each token in the sentence with one of the three labels: I - inside (the entity), O- outside (the entity) and B - beginning (of entity). You saw that IOB labeling is especially helpful if the entities contain multiple words. For example: words like ‘Delta Airlines’, ‘New York, etc, are single entities.

Rule-based method for NER

Chunking is a common shallow parsing technique used to chunk words that constitute some meaningful phrase in the sentence. A noun phrase chunk (NP chunk) is commonly used in NER tasks to identify groups of words that correspond to some ‘entity’.

Sentence: She bought a new car from the BMW showroom.

Noun phrase chunks: a new car, the BMW showroom

The idea of chunking in the context of entity recognition is simple - most entities are nouns and noun phrases, so rules can be written to extract these noun phrases and hopefully extract a large number of named entities. Example of chunking done using regular expressions:

Sentence: John booked the hotel.

Noun phrase chunks: ‘John’, ‘the hotel’

Grammar: NP_chunk: {<DT>?<NN>}\text{NP\_chunk: \{<DT>?<NN>\}}

Probabilistic method for NER

The following two probabilistic models to get the most probable IOB tags for word:

Gazetteer Lookup, another way to identify named entities (like cities and states) is to look up a dictionary or a gazetteer. A gazetteer is a geographical directory which stores data regarding the names of geographical entities (cities, states, countries) and some other features related to the geographies.

Naive Bayes and Decision Tree classifier can also be used in NER.

Conditional Random Fields

HMMs can be used for any sequence classification task, such as NER. However, many NER tasks and datasets are far more complex than tasks such as POS tagging, and therefore, more sophisticated sequence models have been developed and widely accepted in the NLP community. One of these models is Conditional Random Fields (CRFs).

CRFs are used in a wide variety of sequence labelling tasks across various domains - POS tagging, speech recognition, NER, and even in computational biology for modelling genetic patterns etc. CRFs model the conditional probability P(YX)P(Y|X), where YY is the vector of output sequence (IOB labels here) and XX is the input sequence (words to be tagged), which are similar to Logistic Regression classifier. Broadly, there are two types of classifiers in ML:

CRFs use ‘feature functions’ rather than the input word sequence xx itself. The idea is similar to how features are extracted for building the naive Bayes and decision tree classifiers in a previous section. Some example ‘word-features’ (each word has these features) are:

A feature function takes the following four inputs:

Let’s see an example of a feature function:

A feature function f1f_1 which returns 1 if the word xix_i is a city and the corresponding label yiy_i is ‘I-location’, else 0. This can be represented as:

f1(x,i,yi,yi1)=[[xi is in city last name] and [yi is I-location]]f_{1}(x,i,y_i,y_{i-1})= [[x_i \text{ is in city last name}] \text{ and } [y_i \text{ is I-location}]]

The feature function returns 1 only if both the conditions are satisfied, i.e. when the word is a city name and is tagged as ‘I-location’ (e.g. Tokyo/I-location).

Every feature function fif_i has a weight wiw_i associated with it, which represents the ‘importance’ of that feature function. This is almost exactly the same as logistic regression where coefficients of features represent their importance. Training a CRF means to compute the optimal weight vector ww which best represents the observed sequences yy for the given word sequences xx. In other words, we want to find the set of weights ww which maximises P(yx,w)P(y|x,w).

In CRFs, the conditional probabilities P(yx,w)P(y|x,w) are modeled using a scoring function. If there are kk feature functions (and thus kk weights), for each word ii in the sequence xx, a scoring function for a word is defined as follows:

scorei=exp(w1.f1+w2.f2...+wk.fk)=exp(w.f(yi,xi,yi1,i))score_i = exp(w_1.f_1 + w_2.f_2 ... + w_k.f_k) = exp(w.f(y_i,x_i,y_{i-1},i))

and the overall sequence score for the sentence can be defined as:

sequence-score(yx)=i=1n(exp(w.f(yi,xi,yi1,i)))=exp(1n(w.f(yi,xi,yi1,i)))\text{sequence-score}(y|x) = \prod_{i=1}^n (exp(w.f(y_i,x_i,y_{i-1},i))) = exp(\sum_1^n(w.f(y_i,x_i,y_{i-1},i)))

The probability of observing the label sequence yy given the input sequence xx is given by:

P(yx,w)=exp(1n(w.f(yi,xi,yi1,i)))/Z(x)=exp(w.f(x,y))/Z(x)P(y|x,w) = exp(\sum_1^n(w.f(y_i,x_i,y_{i-1},i)))/Z(x) = exp(w.f(x,y))/Z(x)

where Z(x)Z(x) is sum of scores of all possible tag sequences NN =1N(exp(w.f(x,y)))= \sum_1^N(exp(w.f(x,y)))

Training a CRF model means to compute the optimal set of weights ww which best represents the observed sequences yy for the given word sequences xx. In other words, we want to find the set of weights ww which maximises the conditional probability P(yx,w)P(y|x,w) for all the observed sequences (x,y)(x,y), by taking log and simplifying the equations and adding a regularization term to prevent overfitting, the final equation comes out as:

L(w)=1N[(w.f)log(Z)]regularization termL(w) = \sum_1^N[(w.f)-log(Z)] - \text{regularization term}

The inference task to assign the label sequence yy^* to xx which maximises the score of the sequence, i.e.

y=argmax(w.f(x,y))y^* = argmax(w.f(x,y))

The naive way to get yy^* is by calculating w.f(x,y)w.f(x,y) for every possible label sequence , and then choose the label sequence that has maximum (w.f(x,y))(w.f(x,y)) value. However, there are an exponential number of possible labels (tnt^n for a tag set of size tt and a sentence of length nn), and this task is computationally heavy.