PAOLO PULCINI

Neural IR:a personal journey between the latent and the observable space

Get the slides here: paoloearth.github.io/Neural_IR_slides/

A bit of context


IR is a big field.

Neural IR is a new interesting sub-field of it.

Books could be written on the topic but time and space are limited so ...

I hope you will enjoy my selection

Leonardo da Vinci
Leonardo da Vinci

Goals

Or things that you should know by the end of the presentation


  • General Overview

    About the applications of NN to IR tasks

  • Able to Experiment

    And implement a basic neural information systems

  • Know where to look

    When developing your neural IR

Neural IR : What / Why / How.


  • What is it?

    Neural IR is the application of shallow or deep neural networks to IR tasks

  • Why neural IR could be a good idea?

    From 2010 the application of NN to CV, Speech recognition & others real-world application has led to several breakthroughs. This relatively new scenario could benefit as well.

  • How are NN being applied to IR tasks

    The characteristics of the application plays the main role in definining the problem. Different architectures (and datasets) solve different problems.

Took from "An Introduction to Neural Information Retrieval"

Where are NN used?

Categorizations:

  1. NN influences the representation of the query
  2. NN influences the representation of the documents
  3. NN influences the maching/relevance estimation
  4. NN influences any combination of the previously mentioned

* * *

Representation: be wise enough to choose the one that best suits your problem(I)


Terms as vectors

Vector representations is by far the most common.
Two main categories:

  1. Local representation (aka one-hot)
  2. Distributed representation


Vectors allow for arithmetic operations

Similarity

Different representations schemes defines distinct notions of similarity between the terms in the corresponding vector space. This lead to different levels of generalization. It is important to learn a term representation that is suitable for each specific task


Local representations


Leonardo da Vinci
  1. Each term is a unique entity
  2. Terms outside of the fixed vocabulary have no representation

Distributed representations


Leonardo da Vinci
  1. Each term is represented by a (spare or dense) vector of: hand-crafted features or a latent representation
  2. This feature extraction procedure should allow the definition of "similarity" based on such properties.

Representation: be wise enough to choose the one that best suits your problem(II)


Distributional hypothesis

"A word is characterized by the company it keeps "

Firth (1957)

Observed

  • Representations that are measurable(explicitely) from the data. Categorized on the base of:
    • Distributional features : (e.g., in documents, neighbouring terms with or without distances)
    • Weighting schemes applied over the raw counts(e.g. TF-IDF)
    • Can capture interesting relationships but resultant representations are highly sparse and high-dimensional

Embeddings

  • Simpler representations that are learnt from data and assimilate the properties of the terms and the inter-term relationships observable in the original feature space.
  • NB: With both representation is possible to use cosine similarity as metric.

The quick brown fox jumps over the lazy dog

In-document features


Character-trigraph features
Neighbouring-term features


Neighbouring-term with distance features

Think sparse, but act dense

Why?

  • Enable inexact matching in the embedding space.
  • Computational complexity.
  • "Models that learn lower dimensional representations performs better than explicit counting-based models on different tasks—possibly due to better generalization across terms". (Levy et al., 2015)

Latent Semantic Analysis

  • Performs singular value decomposition on a term-document matrix X to obtain its low-rank approximation.

Non Neural Learning of embeddings

Term embeddings for IR

Unsupervised term embeddings can be utilized in 2 main ways:

Query-document matching directly in the latent space

  • Embedding based models often perform poorly when the retrieval is performed over the full document collection. For example, the model may know that a "passage" in text is about fruit, but it fails to realize that the passage is about banana

Telescoping

  • IDEA: Chain different IR models where each successive model re-ranks a smaller number of candidate documents
  • PRACTICE: Use the embedding based model to re-rank only a subset of the documents retrieved by a different IR model.

Query expansion

  • Use embeddings to find good expansion candidates from a global vocabulary, and then retrieving documents using the expanded query.

Learning to rank: a supervised tale


LTR for IR uses relevance information, such as labels and click data as training data to improve the pertinence of the ranking







About the MODEL...

Autoencoders

Unsupervised learning model based on the information bottleneck method, the train consists in feeding in the high-dimensional vector inputs and trying to reconstruct the same representation at the output layer

In the process the parameters that encode & decode the input are adjusted to minimize the squared loss (usually) between the reconstructed output and the input

Variational autoencoders

The encoder generates two separate vectors: means \( \vec{\mu} \) and deviations \( \vec{\sigma} \) (latent space is 2*z).

The latent representation is generated by sampling a gaussian along each of the k latent dimensions.

By sampling the latent representation, we expose the decoder to a certain degree of local variations in its input that should force the model to learn a smoother continuous latent space.

An important application of VAE is for the synthesis of new items or text not observed in the training collection.

Siamese Network

The siamese network's architecture consists of 2 models ( \( m1,m2 \)), that projects 2 inputs ( \( i1,i2 \)) to the same latent space, thus obtaining ( \( v1,v2 \))

The distance(usually cosine similarity) is then computed and the parameters, which are shared, are optimized such that ( \( v1,v2 \)) are closer when when expected to be close & further otherwise.

For example, the features representing all the documents about “banana bread” should be very close to each other, but far away from feature clusters of all other documents.

The LOSS is usually the triplet loss: where a baseline vector \( \vec q \) (query) is compared against a positive(relevant) vector \( \vec p \) (e.g. document clicked by the user) and a negative vector \( \vec n \) (falsy document sampled with uniform probability from the full collection).

A "deep" neural model for ad-hoc retrieval

MODEL is a deep autoencoder trained under unsupervised setting on unlabelled document collection

PREPROCESSING consisted of removing common stopwords, stemming, and then only considering the 2000 most frequent words in the corpus (training set).

IMPLEMENTATION map semantically similar documents to nearby addresses (in memory), allowing for a fast retrieval

INPUT: is composed by word-count (of the 2000 words) of a document

INDEXING: a document is mapped to a word-count vector and then this vector is passed through autoencoder and encoded to 32-bit address.

OUTPUT: a learned binary code for the document which represents its matching address (in memory)

RETRIEVAL makes use of TELESCOPING technique: the hashing is used to preselect the top x documents in the query address or in close addresses (up to 4 bits in hamming distance). The subset is then reranked via TF-IDF. (sectio 4.3 of the paper)

PROS: Semantic hashing is independent of the size of the document collection n and linear in the size of the shortlist of similar documents. LSA search time, for example, linearly depends on the size of the corpus.

RESULTS: experiments proved that the use of semantic hashing as a filter for TF-IDF lead to a higher precision and recall than TF-IDF applied to the whole document, in a much faster time

Interaction based network

IDEA:

Compare different parts of the query with different parts of the document, then, aggregate these partial evidences of relevance

This operation could be very usefull when dealing with long documents which may contain a mixture of many topics

Implementation:

A sliding window is moved over both the query and the document text and each instance of the window over the query is compared against each instance of the window over the document text, generating an "interaction matrix"

A neural model (typically convolutional) operates over the generated interaction matrix and aggregates the evidence across all the pairs of windows compared, to find patterns of matches that suggest relevance of the document to the query

Lexical & semantic matching

Context:

Most of the applications of NN to IR are about finding good embeddings, that is, a good representation of text. These representation presents both advantages & disadvantages.

Embedding based models often perform poorly on retrieval task of specific terms like proper names of companies, places etc. (e.g. Lee's sausage company), since it is unlikely that the model would have a good representation for such term.

On the other hand, a lexical based matches would not work when the system is asked something "implicit" like: "On which channel is the Ajax playing today"? The target document will probably contains proper names of channels, like Rai 1 or Canale 5 but not the term "channel" per se.

A duet architecure: A good neural IR model should incorporate both lexical and semantic matching signals.

Conclusions

  • Desiderata of a model

    Of short and long text

    Retrieval of long text: a model must deal with variable length documents where the relevant sections (to the query) may be surrounded by (a lot of) irrelevant text.

    Retrieval of short text: a model must deal with query-document vocabulary mismatch problem, by learning how patterns of query terms and (different) document terms can indicate relevance.

    In either cases, a model should also consider lexical matches when the query contains rare terms (not seen during training) to avoid retrieving semantically related but irrelevant results.

    context

    Ideal IR models should be able to discriminate or rank between documents inferring the meaning of a query from context.

    For example, if one searches for "soccer world cup winner", it is highly probable that what he/she wants is the last edition's winner. And that should be understood by the model via the context or the user’s short or long-term history.

    The need of labelled data

    IR is “a little behind” wrt CV & NLP mostly because it suers heavily (and for good reasons) for lack of annotated (labelled) document collections, for privacy reasons.

  • Present problems and future goals


    "Should the ideal IR model behave like a library that knows about everything in the Universe, or like a librarian who can effectively retrieve without memorizing the corpus"

    Mitra (2018)

    A brute force approach

😎

Thank you!