Natural Language Processing

There is almost no logic to any language that humans speak. Most of our languages have evolved over time - based on convenience and random acceptance. (Only exception is Sansrit that was formally created and finalized before it could be used). One can never produce an algorithm that can map an English sentence to its meaning. Yet, the mind has absolutely no problem doing this job! How does it manage this? NLP is a science of trying to understand the mind's way of understanding a language and producing text.

But there are many fundamental problems in doing this. Let us get a feel of some of these problems - so that we can understand the value of the solutions that researchers have identified.

One of the fundamental obstacles in processing languages is that there is no intuitive way to measure words. All our algorithms work with numbers. All the machine learning algorithms are based on the fundamental assumption that two data points with similar values are similar to each other. Now how do we fit languages and words into this?

Well, one might just use the ASCII Codes to get numbers out of words. That does give numbers. But, if we look into it, are two words with similar numbers similar to each other? For example, do "bad" and "dad" mean something similar? In fact, we have many words that have different meanings - "bat" and "bat" are certainly not similar to each other. Well, one could still force this into a model. But a model that fits such a crazy data is bound to be overfitting the training set. Thus, before we can even begin with any kind of training, the first job is to assign numbers to our words - so that we can create a numerical model for them.

Another problem is that speech can never be understood based on a point in time sample. Anything we speak has its meaning based on what we spoke some time back. Any word in a sentence has its meaning in relation with another word that showed up some time before - or perhaps a word that may come up down the line. And there is absolutely no limit to this gap. A word by itself does not carry much meaning in any language. The meaning of a word is in relation to the sentence, and the sentences before and the events that occurred before and the events that happened years ago... that could go trace to the big bang itself! The meaning of anything that we say today can be entirely different in the context of an event in the past.

And that is just the text. When we speak, we have a lot information stuffed into our tone, volume, tempo, etc. The same sentence can have a dramatically different meaning when we stress it differently. The often quoted example is the sentence "I don't think he should get the job". The same sentence has a different meaning depending upon the word we stress. How do we create a model out of this?

And that's not all. How about sarcasm, pun, humor...? Well, things are not as easy as we thought!