##### Information Retrival - Natural Launguage Processing (NLP)

Please leave a remark at the bottom of each page with your useful suggestion.

#### Introduction

Information retrieval (IR) is ﬁnding material (usually documents) of an unstructured nature (usually text) that satisﬁes an information need from within large collections (usually stored on computers).

#### Information retrieval

**Boolean retrieval****The term vocabulary and postings lists**- Document delineation and character sequence decoding
- Determining the vocabulary of terms
- Tokenization
- Dropping common terms: stop words
- Normalization (equivalence classing of terms)
- Stemming and lemmatization

- Faster postings list intersection via skip pointers
- Positional postings and phrase queries
- Biword indexes
- Positional indexes
- Combination schemes

**Dictionaries and tolerant retrieval**- Search structures for dictionaries
- Wildcard queries
- General wildcard queries
- k-gram indexes for wildcard queries

- Spelling correction
- Implementing spelling correction
- Forms of spelling correction
- Edit distance
- k-gram indexes for spelling correction
- Context sensitive spelling correction

- Phonetic correction

**Index construction**- Hardware basics
- Blocked sort-based indexing
- Single-pass in-memory indexing
- Distributed indexing
- Dynamic indexing
- Other types of indexes

**Index compression**- Statistical properties of terms in information retrieval
- Heaps’ law: Estimating the number of terms
- Zipf’s law: Modeling the distribution of terms

- Dictionary compression
- Dictionary as a string
- Blocked storage

- Postings ﬁle compression
- Variable byte codes
- γ codes

- Statistical properties of terms in information retrieval
**Scoring, term weighting and the vector space model**- Parametric and zone indexes
- Weighted zone scoring
- Learning weights
- The optimal weight g

- Term frequency and weighting
- nverse document frequency
- f-idf weighting

- The vector space model for scoring
- Dot products
- Queries as vectors
- Computing vector scores

- Variant tf-idf functions
- Sublinear tf scaling
- Maximum tf normalization
- Document and query weighting schemes
- Pivoted normalized document length

- Parametric and zone indexes
**Computing scores in a complete search system**- Efﬁcient scoring and ranking
- Inexact top K document retrieval
- Index elimination
- Champion lists
- Static quality scores and ordering
- Impact ordering
- Cluster pruning

- Components of an information retrieval system
- Tiered indexes
- Query-term proximity
- Designing parsing and scoring functions
- Putting it all together

- Vector space scoring and query operator interaction

- Efﬁcient scoring and ranking
**Evaluation in information retrieval**- Information retrieval system evaluation
- Standard test collections
- Evaluation of unranked retrieval sets
- Evaluation of ranked retrieval results
- Assessing relevance
- Critiques and justiﬁcations of the concept of relevance

- A broader perspective: System quality and user utility
- System issues
- User utility
- Reﬁning a deployed system

- Results snippets

**Relevance feedback and query expansion**- Relevance feedback and pseudo relevance feedback
- The Rocchio algorithm for relevance feedback
- Probabilistic relevance feedback
- When does relevance feedback work?
- Relevance feedback on the web
- Evaluation of relevance feedback strategies
- Pseudo relevance feedback
- Indirect relevance feedback

- Global methods for query reformulation
- Vocabulary tools for query reformulation
- Query expansion
- Automatic thesaurus generation

- Relevance feedback and pseudo relevance feedback
**XML retrieval**- Basic XML concepts
- Challenges in XML retrieval
- A vector space model for XML retrieval
- Evaluation of XML retrieval
- Text-centric vs. data-centric XML retrieval

**Probabilistic information retrieval**- Review of basic probability theory
- The Probability Ranking Principle
- The 1/0 loss case
- The PRP with retrieval costs

- The Binary Independence Model
- Deriving a ranking function for query terms
- Probability estimates in theory
- Probability estimates in practice
- Probabilistic approaches to relevance feedback

- An appraisal and some extensions
- An appraisal of probabilistic models
- Tree-structured dependencies between terms
- Okapi BM25: a non-binary model
- Bayesian network approaches to IR

**Language models for information retrieval**- Language models
- Finite automata and language models
- Types of language models
- Multinomial distributions over words

- The query likelihood model
- Using query likelihood language models in IR
- Estimating the query generation probability
- Ponte and Croft’s Experiments

- Language modeling versus other approaches in IR
- Extended language modeling approaches

- Language models
**Text classiﬁcation and Naive Bayes**- The text classiﬁcation problem
- Naive Bayes text classiﬁcation
- Relation to multinomial unigram language model

- The Bernoulli model
- Properties of Naive Bayes
- A variant of the multinomial model

- Feature selection
- Mutual information
- χ2 Feature selection
- Frequency-based feature selection
- Feature selection for multiple classiﬁers
- Comparison of feature selection methods

- Evaluation of text classiﬁcation

**Vector space classiﬁcation**- Document representations and measures of relatedness in vector spaces
- Rocchio classiﬁcation
- k nearest neighbor
- Time complexity and optimality of kNN

- Linear versus nonlinear classiﬁers
- Classiﬁcation with more than two classes
- The bias-variance tradeoff

**Support vector machines and machine learning on documents**- Support vector machines: The linearly separable case
- Extensions to the SVM model
- Soft margin classiﬁcation
- Multiclass SVMs
- Nonlinear SVMs
- Experimental results

- Issues in the classiﬁcation of text documents
- Choosing what kind of classiﬁer to use
- Improving classiﬁer performance

- Machine learning methods in ad hoc information retrieval
- A simple example of machine-learned scoring
- Result ranking by machine learning

**Flat clustering**- Clustering in information retrieval
- Problem statement
- Cardinality – the number of clusters

- Evaluation of clustering
- K-means
- Cluster cardinality in K-means

- Model-based clustering

**Hierarchical clustering**- Hierarchical agglomerative clustering
- Single-link and complete-link clustering
- Time complexity of HAC

- Group-average agglomerative clustering
- Centroid clustering
- Optimality of HAC
- Divisive clustering
- Cluster labeling
- Implementation notes

**Matrix decompositions and latent semantic indexing**- Linear algebra review
- Matrix decompositions

- Term-document matrices and singular value decompositions
- Low-rank approximations
- Latent semantic indexing

- Linear algebra review
**Web search basics**- Background and history
- Web characteristics
- The web graph
- Spam

- Advertising as the economic model
- The search user experience
- User query needs

- Index size and estimation
- Near-duplicates and shingling

**Web crawling and indexes**- Crawling
- Crawler architecture
- DNS resolution
- The URL frontier

- Distributing indexes
- Connectivity servers

- Crawling
**Link analysis**- The Web as a graph
- Anchor text and the web graph

- PageRank
- Markov chains
- The PageRank computation
- Topic-speciﬁc PageRank

- Hubs and Authorities
- Choosing the subset of the Web

- The Web as a graph

#### Boolean retrieval

**The term vocabulary and postings lists**- Document delineation and character sequence decoding
- Determining the vocabulary of terms
- Tokenization
- Dropping common terms: stop words
- Normalization (equivalence classing of terms)
- Stemming and lemmatization

- Faster postings list intersection via skip pointers
- Positional postings and phrase queries
- Biword indexes
- Positional indexes
- Combination schemes

#### Dictionaries and tolerant retrieval

- Search structures for dictionaries
- Wildcard queries
- General wildcard queries
- k-gram indexes for wildcard queries

- Spelling correction
- Implementing spelling correction
- Forms of spelling correction
- Edit distance
- k-gram indexes for spelling correction
- Context sensitive spelling correction

- Phonetic correction

#### Index construction

- Hardware basics
- Blocked sort-based indexing
- Single-pass in-memory indexing
- Distributed indexing
- Dynamic indexing
- Other types of indexes

#### Index compression

- Statistical properties of terms in information retrieval
- Heaps’ law: Estimating the number of terms
- Zipf’s law: Modeling the distribution of terms

- Dictionary compression
- Dictionary as a string
- Blocked storage

- Postings ﬁle compression
- Variable byte codes
- γ codes

#### Scoring, term weighting and the vector space model

- Parametric and zone indexes
- Weighted zone scoring
- Learning weights
- The optimal weight g

- Term frequency and weighting
- nverse document frequency
- f-idf weighting

- The vector space model for scoring
- Dot products
- Queries as vectors
- Computing vector scores

- Variant tf-idf functions
- Sublinear tf scaling
- Maximum tf normalization
- Document and query weighting schemes
- Pivoted normalized document length

#### Computing scores in a complete search system

- Efﬁcient scoring and ranking
- Inexact top K document retrieval
- Index elimination
- Champion lists
- Static quality scores and ordering
- Impact ordering
- Cluster pruning

- Components of an information retrieval system
- Tiered indexes
- Query-term proximity
- Designing parsing and scoring functions
- Putting it all together

- Vector space scoring and query operator interaction

#### Evaluation in information retrieval

- Information retrieval system evaluation
- Standard test collections
- Evaluation of unranked retrieval sets
- Evaluation of ranked retrieval results
- Assessing relevance
- Critiques and justiﬁcations of the concept of relevance

- A broader perspective: System quality and user utility
- System issues
- User utility
- Reﬁning a deployed system

- Results snippets

#### Relevance feedback and query expansion

- Relevance feedback and pseudo relevance feedback
- The Rocchio algorithm for relevance feedback
- Probabilistic relevance feedback
- When does relevance feedback work?
- Relevance feedback on the web
- Evaluation of relevance feedback strategies
- Pseudo relevance feedback
- Indirect relevance feedback

- Global methods for query reformulation
- Vocabulary tools for query reformulation
- Query expansion
- Automatic thesaurus generation

#### XML retrieval

- Basic XML concepts
- Challenges in XML retrieval
- A vector space model for XML retrieval
- Evaluation of XML retrieval
- Text-centric vs. data-centric XML retrieval

#### Probabilistic information retrieval

- Review of basic probability theory
- The Probability Ranking Principle
- The 1/0 loss case
- The PRP with retrieval costs

- The Binary Independence Model
- Deriving a ranking function for query terms
- Probability estimates in theory
- Probability estimates in practice
- Probabilistic approaches to relevance feedback

- An appraisal and some extensions
- An appraisal of probabilistic models
- Tree-structured dependencies between terms
- Okapi BM25: a non-binary model
- Bayesian network approaches to IR

#### Language models for information retrieval

- Language models
- Finite automata and language models
- Types of language models
- Multinomial distributions over words

- The query likelihood model
- Using query likelihood language models in IR
- Estimating the query generation probability
- Ponte and Croft’s Experiments

- Language modeling versus other approaches in IR
- Extended language modeling approaches

#### Text classiﬁcation and Naive Bayes

- The text classiﬁcation problem
- Naive Bayes text classiﬁcation
- Relation to multinomial unigram language model

- The Bernoulli model
- Properties of Naive Bayes
- A variant of the multinomial model

- Feature selection
- Mutual information
- χ2 Feature selection
- Frequency-based feature selection
- Feature selection for multiple classiﬁers
- Comparison of feature selection methods

- Evaluation of text classiﬁcation

#### Vector space classiﬁcation

- Document representations and measures of relatedness in vector spaces
- Rocchio classiﬁcation
- k nearest neighbor
- Time complexity and optimality of kNN

- Linear versus nonlinear classiﬁers
- Classiﬁcation with more than two classes
- The bias-variance tradeoff

#### Support vector machines and machine learning on documents

- Support vector machines: The linearly separable case
- Extensions to the SVM model
- Soft margin classiﬁcation
- Multiclass SVMs
- Nonlinear SVMs
- Experimental results

- Issues in the classiﬁcation of text documents
- Choosing what kind of classiﬁer to use
- Improving classiﬁer performance

- Machine learning methods in ad hoc information retrieval
- A simple example of machine-learned scoring
- Result ranking by machine learning

#### Flat clustering

- Clustering in information retrieval
- Problem statement
- Cardinality – the number of clusters

- Evaluation of clustering
- K-means
- Cluster cardinality in K-means

- Model-based clustering

#### Hierarchical clustering

- Hierarchical agglomerative clustering
- Single-link and complete-link clustering
- Time complexity of HAC

- Group-average agglomerative clustering
- Centroid clustering
- Optimality of HAC
- Divisive clustering
- Cluster labeling
- Implementation notes

#### Matrix decompositions and latent semantic indexing

- Linear algebra review
- Matrix decompositions

- Term-document matrices and singular value decompositions
- Low-rank approximations
- Latent semantic indexing

#### Web search basics

- Background and history
- Web characteristics
- The web graph
- Spam

- Advertising as the economic model
- The search user experience
- User query needs

- Index size and estimation
- Near-duplicates and shingling

#### Web crawling and indexes

- Crawling
- Crawler architecture
- DNS resolution
- The URL frontier

- Distributing indexes
- Connectivity servers

#### Link analysis

- The Web as a graph
- Anchor text and the web graph

- PageRank
- Markov chains
- The PageRank computation
- Topic-speciﬁc PageRank

- Hubs and Authorities
- Choosing the subset of the Web