Javier's Blog

Mostly computers and other tech stuff,...

Friday, February 25, 2011

Just a Quick Paper on Computational Linguistics


Communication with computer systems is becoming increasingly natural. Although the technology depicted in current science fiction novels and films might seem farfetched, computer interaction via speech is becoming a reality. Many specialized systems have been developed for use by the disabled, which allows for interaction with a computer system via speech and other means. There are countless speech to text and text to speech applications which can be deployed on ordinary computers. Even most modern cellular phones have some type of speech enabled command capability. It is true that this technology is at its infancy and has a long way to go before we can seamlessly communicate with computer systems via speech. But recent advances in this technology, known as Computational Linguistics, seem to grasp at what someday will no longer be science fiction.

Computational Linguistics, a subdivision of the broader subject known as Natural Language Processing (NLP), deals with language based human-computer interaction, as well as computer aided language translation. Computational Linguistics is used in many applications, speech recognition, language translation, spelling and grammar checking, etc. The primary techniques now used in Computational Linguistics are statistical in nature and have brought significant advances to the field in recent years [1]. A system which is to seamlessly communicate via a natural language must have the following abilities: speech recognition, natural language understanding, natural language generation and speech synthesis, [3] all of which fall under Computational Linguistics. Other abilities such a system must have are information retrieval and extraction as well as inference; these are however, a bit out of the realm of Computational Linguistics. Given the complexity of natural languages, the task that computational linguists have taken up is more difficult than once thought.


There is some debate about the approach that computational linguists have taken. Rational linguists, lead by Noam Chomsky, believe that statistical analysis has little to no chance at encompassing language entirely. Part of the bias against statistical analysis comes from the fact that early statistical NLP systems were extremely simple and could not begin to process the complexity of language. Another argument against statistical analysis is that computing the probability of sentences from a body of text would assign the same probability to grammatical and ungrammatical sentences [2]. Furthermore, there are sentences which are grammatically correct but are nonsensical in nature. Chomsky’s famous example of this is “Colorless green ideas sleep furiously,” a sentence which is grammatically correct but does not make sense. The argument here is that a system which is to communicate with humans must be able to decipher such a sentence as erroneous.

Computational linguists handle this issue by not worrying about which sentences are grammatically correct and which are not, instead, they make note of sentences which are likely to be said. Correct sentences are more likely to be said while incorrect sentences are less likely to be said. Often used sentences, regardless of whether they are considered correct or not, are considered part of the language as they convey some mutually agreed meaning. The earlier issue about the complexity of Computational Linguistic systems is no longer an issue since modern Computational Linguistic systems are nearly as complex as the models developed by rational linguists. The difference being that the former takes a statistical approach to learning and does not try to represent every part of the brain as we understand it.


The main challenge computational linguists have taken up is the disambiguation of language. In the current state of affairs, computer systems learn from bodies of text, known as corpora [2], as they cannot observe the natural word around us and infer information as we do. The problem with this approach is that sentences often may be parsed in more ways than one. Sentences, for example, may be parsed so that their verb groups contain one word for one meaning, and multiple words for another meaning. Under these situations multiple parse trees may be generated, each parse tree having a slightly different meaning than similar trees. For long sentences, the number of applicable parse trees may be enormous [2].

Computational Linguists argue that by using a statistical NLP approach, where lexical and structural preferences in language are computed, the issue of numerous permutations of parse trees becomes moot. This approach aims at approximating the appropriate representation for a parse tree which conveys the meaning of the sentence. Lexical and structural preferences are learned or remembered by Computational Linguistic systems via N-grams.


N-grams are collections of words as they would appear in a natural language i.e. a sentence fraction where ‘n’ is the number of words in that fraction. For example, given the sentence “the fox jumped over the dog,” its corresponding trigram (3-gram) would consist of “the fox jumped”, “fox jumped over”, “jumped over the”, “over the dog.” These n-grams are built by parsing corpora and partitioned by having the previous (n - 1) words in common. A state sequence machine or automaton can be built from these n-gram partitions to identify the probabilities of one word or state to follow another. Using the statistical qualities of n-grams, the next word (n) may be predicted in a sentence fragment with relative accuracy. A variation of this model, where the state sequence is originally unknown, is referred to as the Hidden Markov Model (HMM). HMMs are the foundation to modern speech recognition systems [2] as well as other NLP applications.


Augmented Transition Networks (ATNs) build on the idea of using finite state machines in order to grammatically parse sentences. W. A. Wood in “Transition Network Grammars for Natural Language Analysis” claims that by adding a recursive mechanism to a finite state model, parsing can be achieved much more efficiently. Instead of building an automaton for a particular sentence, a collection of transition graphs are built. A grammatically correct sentence is parsed by reaching a final state in any state graph. Transitions between these graphs are simply subroutine calls from one state to any initial state on any graph in the network. A sentence is determined to be grammatically correct if a final state is reached by the last word in the sentence.

This model meets many of the goals set forth by the nature of language in that it captures the regularities of the language. That is, if there is a process that operates in a number of environments, the grammar should encapsulate the process in a single structure [4]. Such encapsulation not only simplifies the grammar, but has the added bonus of efficiency of operation. Another advantage of such a model is the ability to postpone decisions. Many grammars use guessing when an ambiguity comes up. This means that not enough is yet known about the sentence. By the use of recursion, ATNs solve this inefficiency by postponing decisions until more is known about a sentence [4].


As it can be clearly seen by this brief overview of Computational Linguistics, the field is complex and evolving. Although Computational Linguistics is at its infancy, much of the ground work has already been laid out. It is hard to determine whether a system with similar abilities to those of 3CPO (Star Wars) will ever become a reality, but it is clear that a subset of those abilities are already helping us with our everyday lives. Applications such as speech recognition and text translation are not perfect by any means, but are simply a subset of the capabilities future systems will have at their disposal.


[1] Steven Abney. Statistical Methods and Linguistics. In: Judith Klavans and Philip Resnik (eds.), The Balancing Act: Combining Symbolic and Statistical Approaches to Language. The MIT Press, Cambridge, MA. 1996.

[2] Christopher D. Manning and Hinrich Schütze, 1999, Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, MA.

[3] Daniel Jurafsky and James H. Martin, 2000, Speech and Language Processing, Prentice Hall, Upper Saddle River, New Jersey.

[4] Transition Network Grammars for Natural Language Analysis, W. A. Woods, Communications of the ACM, Volume 13 , Issue 10 (October 1970) Pages: 591 - 606, ISSN:0001-0782