Unit 1
Introduction- History of NLP, Study of Human languages, ambiguity, Phases in natural language processing, applications. Textual sources and Formats. Linguistics resources- Introduction to the corpus, elements in the balanced corpus, (examples -TreeBank, PropBank, WordNet, VerbNet, etc.) Word Level analysis – Regular expressions, Morphological parsing, Types of Morphemes. Tokenization, N-grams, Stemming, Lemmatization, Spell checking. Management of linguistic data with NLTK.
Unit 2
Syntactic Analysis – Lexeme, phonemes, phrases and idioms, word order, agreement, tense, aspect and mood and agreement, Context Free Grammar, and spoken language syntax. Parsing- Unification, probabilistic parsing. Part of Speech tagging- Rule-based POS tagging, Stochastic POS tagging, Transformation-based tagging (TBL), Handling of unknown words, named entities, and multi-word expressions.
Semantics Analysis- Meaning representation, semantic analysis, lexical semantics, WordNet -WordNet similarity measures., Synsets and Hypernyms, Word Sense Disambiguation- Selectional restriction, machine learning approaches, dictionary-based approaches.
Unit 3
Discourse- Reference resolution, constraints on co-reference, an algorithm for pronoun resolution, text coherence, discourse structure. Information Retrieval-Types of an information retrieval model, Boolean Model, Vector space model-Word2Vec, BERT, Improving user queries. Machine Translation – EM algorithm – Discriminative learning – Deep representation learning – Generative learning.
Applications of NLP- Machine translation, Document Summarization, sentiment Analysis, ChatGPT4