Natural Language Processing(NLP)

Introduction

Natural Language : It is an ordinary language which is involve naturally in human through use and repetition without conscious planning or premeditation.

•NLP: NLP is a combination of linguistics, computer science , and artificial intelligence which is basically focusing about interaction between computers and human language.

•NLP deals with processing and analyzing large amounts of natural language data.


History

1950 Alan Turing When he published an article titled “ Computing Machinery and Intelligence” which is now called Turing test as a criterion of intelligence.

1964 / Eliza Joseph weizenbaum : A simulation of a Rogerian psychotherapist, rephrasing her response with a few grammar rules.

1970/ SHRDLU Terry Winograd: A natural language system working in restricted “Block Word” with restricted vocabularies, work extremely well.

1982/ Jabberwacky Rollo Carpenter: Chatterbot with stated aim to “simulate natural human chat in an interesting , entertaining and humorous manner

1990/ Dr. Sbaitso: Creating Lab (Singaporean company): AI speech synthesis program for MS-DOS based personal computer. Software

2006 / Watson IBM AI Based software

2011 / Siri Apple A Virtual Assistant

2014 / Amazon Alexa Amazon A Virtual Assistant

2016/ Goggle Assistant Google A Virtual Assistant

NLP Components

Phonetics and phonology: The study of language sounds

Ecology: The study of language conventions for punctuation, text mark-up and encoding

Morphology: The study of meaningful components of words

Syntax: The study of structural relationships among words

Lexical semantics: The study of word meaning

Compositional semantics :The study of the meaning of sentences

Pragmatics : The study of the use of language to accomplish goals

Discourse conventions: The study of conventions of dialogue

What is Text?

•A set of characters which belong to a particular language and having specific meaning.

•The textual information which are available in many forms and languages has to be processed first before feeding to machine.

Text processing

There are three processing techniques which are widely being used for text analysis:


Lexical Processing

When we plot any text document which are having enough words , then we see that word frequency follow the Zipf distribution

Mostly three types of words available in the text corpus:

1.Stop words : such as is , an, the, etc

2.Significant words : These words help us for real text analysis

3.Rarely occurring words

Stop words are not much useful for the many application so we used remove it because its takes a lot of memory and decrease the model performance.


Tokenization

•Tokenization is a processing of breaking text corpus into different words, sentences or paragraphs.

•The breaking of information will be a per the requirement of an application .


After removal of the stop words, we need to take care of redundant information as well.


Text Representation

Bag of Words

•Also called Bag-of-Words model

•Each row of the table represent each document.

•Columns represent the vocabulary of the text

For example :

Doc1: “Dangal is a super duper hit movie”

Doc2: “The succuss of movie depends upon the performance of the actors”

Doc3: “No movies are releasing due to Covid”

•In above model, after removing all the stop words, the values in the cell represent the number of times a term ’t’ is present in the document ‘d’ which is term frequency.

•Term movie is presents in all the documents while actor is present in second documents.


TF-IDF

Instead of focusing word frequencies in the tables which been created for bag-of-words models, we can have the representation which focus more on word importance.


TF Value Calculation

Review 2: This movie is not scary and is slow

Here,

•Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’

•Number of words in Review 2 = 8

•TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/(number of terms in review 2) = 1/8


IDF calculation

We can calculate the IDF values for the all the words in Review 2: IDF(‘this’) = log(number of documents/number of documents containing the word ‘this’) = log(3/3) = log(1) = 0

TF-IDF Calculation

Application of TF-IDF

Spam Email Detector

Sentiment Classification


Syntactic Processing

Focuses more on grammar syntax.

Widely used in application such as

•Question answering systems,

•Information Extraction ,

•Sentiment Analysis ,

•Grammar Checking

Part of speech(PoS) tagging is an important task which is used a preprocessing steps in many application.


Semantic Processing


Focuses mor on the meaning of given peace of text.

Text Database

WordNet and Concept Net: A semantically oriented a dictionary of English with richer structure.

Word Sense Disambiguation

WSD task is to identify the correct sense of an ambiguous word such as bank, bark, pitch etc.

Lesk Algorithm


NLP is being used in many application

social media analysis : Twitter sentiment analysis, topic modeling etc.

Chatbot

Information Extraction etc.