Information Retrival - Natural Launguage Processing (NLP)

Please leave a remark at the bottom of each page with your useful suggestion.


Introduction

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).


Information retrieval

  • Boolean retrieval
  • The term vocabulary and postings lists
    • Document delineation and character sequence decoding
    • Determining the vocabulary of terms
      • Tokenization
      • Dropping common terms: stop words
      • Normalization (equivalence classing of terms)
      • Stemming and lemmatization
    • Faster postings list intersection via skip pointers
    • Positional postings and phrase queries
      • Biword indexes
      • Positional indexes
      • Combination schemes
  • Dictionaries and tolerant retrieval
    • Search structures for dictionaries
    • Wildcard queries
      • General wildcard queries
      • k-gram indexes for wildcard queries
    • Spelling correction
      • Implementing spelling correction
      • Forms of spelling correction
      • Edit distance
      • k-gram indexes for spelling correction
      • Context sensitive spelling correction
    • Phonetic correction
  • Index construction
    • Hardware basics
    • Blocked sort-based indexing
    • Single-pass in-memory indexing
    • Distributed indexing
    • Dynamic indexing
    • Other types of indexes
  • Index compression
    • Statistical properties of terms in information retrieval
      • Heaps’ law: Estimating the number of terms
      • Zipf’s law: Modeling the distribution of terms
    • Dictionary compression
      • Dictionary as a string
      • Blocked storage
    • Postings file compression
      • Variable byte codes
      • γ codes
  • Scoring, term weighting and the vector space model
    • Parametric and zone indexes
      • Weighted zone scoring
      • Learning weights
      • The optimal weight g
    • Term frequency and weighting
      • nverse document frequency
      • f-idf weighting
    • The vector space model for scoring
      • Dot products
      • Queries as vectors
      • Computing vector scores
    • Variant tf-idf functions
      • Sublinear tf scaling
      • Maximum tf normalization
      • Document and query weighting schemes
      • Pivoted normalized document length
  • Computing scores in a complete search system
    • Efficient scoring and ranking
      • Inexact top K document retrieval
      • Index elimination
      • Champion lists
      • Static quality scores and ordering
      • Impact ordering
      • Cluster pruning
    • Components of an information retrieval system
      • Tiered indexes
      • Query-term proximity
      • Designing parsing and scoring functions
      • Putting it all together
    • Vector space scoring and query operator interaction
  • Evaluation in information retrieval
    • Information retrieval system evaluation
    • Standard test collections
    • Evaluation of unranked retrieval sets
    • Evaluation of ranked retrieval results
    • Assessing relevance
      • Critiques and justifications of the concept of relevance
    • A broader perspective: System quality and user utility
      • System issues
      • User utility
      • Refining a deployed system
    • Results snippets
  • Relevance feedback and query expansion
    • Relevance feedback and pseudo relevance feedback
      • The Rocchio algorithm for relevance feedback
      • Probabilistic relevance feedback
      • When does relevance feedback work?
      • Relevance feedback on the web
      • Evaluation of relevance feedback strategies
      • Pseudo relevance feedback
      • Indirect relevance feedback
    • Global methods for query reformulation
      • Vocabulary tools for query reformulation
      • Query expansion
      • Automatic thesaurus generation
  • XML retrieval
    • Basic XML concepts
    • Challenges in XML retrieval
    • A vector space model for XML retrieval
    • Evaluation of XML retrieval
    • Text-centric vs. data-centric XML retrieval
  • Probabilistic information retrieval
    • Review of basic probability theory
    • The Probability Ranking Principle
      • The 1/0 loss case
      • The PRP with retrieval costs
    • The Binary Independence Model
      • Deriving a ranking function for query terms
      • Probability estimates in theory
      • Probability estimates in practice
      • Probabilistic approaches to relevance feedback
    • An appraisal and some extensions
      • An appraisal of probabilistic models
      • Tree-structured dependencies between terms
      • Okapi BM25: a non-binary model
      • Bayesian network approaches to IR
  • Language models for information retrieval
    • Language models
      • Finite automata and language models
      • Types of language models
      • Multinomial distributions over words
    • The query likelihood model
      • Using query likelihood language models in IR
      • Estimating the query generation probability
      • Ponte and Croft’s Experiments
    • Language modeling versus other approaches in IR
    • Extended language modeling approaches
  • Text classification and Naive Bayes
    • The text classification problem
    • Naive Bayes text classification
      • Relation to multinomial unigram language model
    • The Bernoulli model
    • Properties of Naive Bayes
      • A variant of the multinomial model
    • Feature selection
      • Mutual information
      • χ2 Feature selection
      • Frequency-based feature selection
      • Feature selection for multiple classifiers
      • Comparison of feature selection methods
    • Evaluation of text classification
  • Vector space classification
    • Document representations and measures of relatedness in vector spaces
    • Rocchio classification
    • k nearest neighbor
      • Time complexity and optimality of kNN
    • Linear versus nonlinear classifiers
    • Classification with more than two classes
    • The bias-variance tradeoff
  • Support vector machines and machine learning on documents
    • Support vector machines: The linearly separable case
    • Extensions to the SVM model
      • Soft margin classification
      • Multiclass SVMs
      • Nonlinear SVMs
      • Experimental results
    • Issues in the classification of text documents
      • Choosing what kind of classifier to use
      • Improving classifier performance
    • Machine learning methods in ad hoc information retrieval
      • A simple example of machine-learned scoring
      • Result ranking by machine learning
  • Flat clustering
    • Clustering in information retrieval
    • Problem statement
      • Cardinality – the number of clusters
    • Evaluation of clustering
    • K-means
      • Cluster cardinality in K-means
    • Model-based clustering
  • Hierarchical clustering
    • Hierarchical agglomerative clustering
    • Single-link and complete-link clustering
      • Time complexity of HAC
    • Group-average agglomerative clustering
    • Centroid clustering
    • Optimality of HAC
    • Divisive clustering
    • Cluster labeling
    • Implementation notes
  • Matrix decompositions and latent semantic indexing
    • Linear algebra review
      • Matrix decompositions
    • Term-document matrices and singular value decompositions
    • Low-rank approximations
    • Latent semantic indexing
  • Web search basics
    • Background and history
    • Web characteristics
      • The web graph
      • Spam
    • Advertising as the economic model
    • The search user experience
      • User query needs
    • Index size and estimation
    • Near-duplicates and shingling
  • Web crawling and indexes
    • Crawling
      • Crawler architecture
      • DNS resolution
      • The URL frontier
    • Distributing indexes
    • Connectivity servers
  • Link analysis
    • The Web as a graph
      • Anchor text and the web graph
    • PageRank
      • Markov chains
      • The PageRank computation
      • Topic-specific PageRank
    • Hubs and Authorities
      • Choosing the subset of the Web

Boolean retrieval

  • The term vocabulary and postings lists
    • Document delineation and character sequence decoding
    • Determining the vocabulary of terms
      • Tokenization
      • Dropping common terms: stop words
      • Normalization (equivalence classing of terms)
      • Stemming and lemmatization
    • Faster postings list intersection via skip pointers
    • Positional postings and phrase queries
      • Biword indexes
      • Positional indexes
      • Combination schemes

Dictionaries and tolerant retrieval

  • Search structures for dictionaries
  • Wildcard queries
    • General wildcard queries
    • k-gram indexes for wildcard queries
  • Spelling correction
    • Implementing spelling correction
    • Forms of spelling correction
    • Edit distance
    • k-gram indexes for spelling correction
    • Context sensitive spelling correction
  • Phonetic correction

Index construction

  • Hardware basics
  • Blocked sort-based indexing
  • Single-pass in-memory indexing
  • Distributed indexing
  • Dynamic indexing
  • Other types of indexes

Index compression

  • Statistical properties of terms in information retrieval
    • Heaps’ law: Estimating the number of terms
    • Zipf’s law: Modeling the distribution of terms
  • Dictionary compression
    • Dictionary as a string
    • Blocked storage
  • Postings file compression
    • Variable byte codes
    • γ codes

Scoring, term weighting and the vector space model

  • Parametric and zone indexes
    • Weighted zone scoring
    • Learning weights
    • The optimal weight g
  • Term frequency and weighting
    • nverse document frequency
    • f-idf weighting
  • The vector space model for scoring
    • Dot products
    • Queries as vectors
    • Computing vector scores
  • Variant tf-idf functions
    • Sublinear tf scaling
    • Maximum tf normalization
    • Document and query weighting schemes
    • Pivoted normalized document length

Computing scores in a complete search system

  • Efficient scoring and ranking
    • Inexact top K document retrieval
    • Index elimination
    • Champion lists
    • Static quality scores and ordering
    • Impact ordering
    • Cluster pruning
  • Components of an information retrieval system
    • Tiered indexes
    • Query-term proximity
    • Designing parsing and scoring functions
    • Putting it all together
  • Vector space scoring and query operator interaction

Evaluation in information retrieval

  • Information retrieval system evaluation
  • Standard test collections
  • Evaluation of unranked retrieval sets
  • Evaluation of ranked retrieval results
  • Assessing relevance
    • Critiques and justifications of the concept of relevance
  • A broader perspective: System quality and user utility
    • System issues
    • User utility
    • Refining a deployed system
  • Results snippets

Relevance feedback and query expansion

  • Relevance feedback and pseudo relevance feedback
    • The Rocchio algorithm for relevance feedback
    • Probabilistic relevance feedback
    • When does relevance feedback work?
    • Relevance feedback on the web
    • Evaluation of relevance feedback strategies
    • Pseudo relevance feedback
    • Indirect relevance feedback
  • Global methods for query reformulation
    • Vocabulary tools for query reformulation
    • Query expansion
    • Automatic thesaurus generation

XML retrieval

  • Basic XML concepts
  • Challenges in XML retrieval
  • A vector space model for XML retrieval
  • Evaluation of XML retrieval
  • Text-centric vs. data-centric XML retrieval

Probabilistic information retrieval

  • Review of basic probability theory
  • The Probability Ranking Principle
    • The 1/0 loss case
    • The PRP with retrieval costs
  • The Binary Independence Model
    • Deriving a ranking function for query terms
    • Probability estimates in theory
    • Probability estimates in practice
    • Probabilistic approaches to relevance feedback
  • An appraisal and some extensions
    • An appraisal of probabilistic models
    • Tree-structured dependencies between terms
    • Okapi BM25: a non-binary model
    • Bayesian network approaches to IR

Language models for information retrieval

  • Language models
    • Finite automata and language models
    • Types of language models
    • Multinomial distributions over words
  • The query likelihood model
    • Using query likelihood language models in IR
    • Estimating the query generation probability
    • Ponte and Croft’s Experiments
  • Language modeling versus other approaches in IR
  • Extended language modeling approaches

Text classification and Naive Bayes

  • The text classification problem
  • Naive Bayes text classification
    • Relation to multinomial unigram language model
  • The Bernoulli model
  • Properties of Naive Bayes
    • A variant of the multinomial model
  • Feature selection
    • Mutual information
    • χ2 Feature selection
    • Frequency-based feature selection
    • Feature selection for multiple classifiers
    • Comparison of feature selection methods
  • Evaluation of text classification

Vector space classification

  • Document representations and measures of relatedness in vector spaces
  • Rocchio classification
  • k nearest neighbor
    • Time complexity and optimality of kNN
  • Linear versus nonlinear classifiers
  • Classification with more than two classes
  • The bias-variance tradeoff

Support vector machines and machine learning on documents

  • Support vector machines: The linearly separable case
  • Extensions to the SVM model
    • Soft margin classification
    • Multiclass SVMs
    • Nonlinear SVMs
    • Experimental results
  • Issues in the classification of text documents
    • Choosing what kind of classifier to use
    • Improving classifier performance
  • Machine learning methods in ad hoc information retrieval
    • A simple example of machine-learned scoring
    • Result ranking by machine learning

Flat clustering

  • Clustering in information retrieval
  • Problem statement
    • Cardinality – the number of clusters
  • Evaluation of clustering
  • K-means
    • Cluster cardinality in K-means
  • Model-based clustering

Hierarchical clustering

  • Hierarchical agglomerative clustering
  • Single-link and complete-link clustering
    • Time complexity of HAC
  • Group-average agglomerative clustering
  • Centroid clustering
  • Optimality of HAC
  • Divisive clustering
  • Cluster labeling
  • Implementation notes

Matrix decompositions and latent semantic indexing

  • Linear algebra review
    • Matrix decompositions
  • Term-document matrices and singular value decompositions
  • Low-rank approximations
  • Latent semantic indexing

Web search basics

  • Background and history
  • Web characteristics
    • The web graph
    • Spam
  • Advertising as the economic model
  • The search user experience
    • User query needs
  • Index size and estimation
  • Near-duplicates and shingling

Web crawling and indexes

  • Crawling
    • Crawler architecture
    • DNS resolution
    • The URL frontier
  • Distributing indexes
  • Connectivity servers

Link analysis

  • The Web as a graph
    • Anchor text and the web graph
  • PageRank
    • Markov chains
    • The PageRank computation
    • Topic-specific PageRank
  • Hubs and Authorities
    • Choosing the subset of the Web



Write Your Comments or Suggestion...