PAOLO PULCINI

Neural IR:a personal journey between the latent and the observable space

Get the slides here: paoloearth.github.io/Neural_IR_slides/

A bit of context

IR is a big field.

Neural IR is a new interesting sub-field of it.

Books could be written on the topic but time and space are limited so ...

I hope you will enjoy my selection

Goals

Or things that you should know by the end of the presentation

General Overview

About the applications of NN to IR tasks
Able to Experiment

And implement a basic neural information systems
Know where to look

When developing your neural IR

Roadmap

Not to get lost

Neural IR
Representations is everything
Unsupervised Learning
Query-document matching
(Supervised) Learning
NN Architectures & INPUTS
Conclusions

Bonus: Live Demo

Neural IR : What / Why / How.

What is it?

Neural IR is the application of shallow or deep neural networks to IR tasks
Why neural IR could be a good idea?

From 2010 the application of NN to CV, Speech recognition & others real-world application has led to several breakthroughs. This relatively new scenario could benefit as well.
How are NN being applied to IR tasks

The characteristics of the application plays the main role in definining the problem. Different architectures (and datasets) solve different problems.

Took from "An Introduction to Neural Information Retrieval"

Where are NN used?

Categorizations:

NN influences the representation of the query
NN influences the representation of the documents
NN influences the maching/relevance estimation
NN influences any combination of the previously mentioned

* * *

Representation: be wise enough to choose the one that best suits your problem(I)

Terms as vectors

Vector representations is by far the most common.
Two main categories:

Local representation (aka one-hot)
Distributed representation

Vectors allow for arithmetic operations

Similarity

Different representations schemes defines distinct notions of similarity between the terms in the corresponding vector space. This lead to different levels of generalization. It is important to learn a term representation that is suitable for each specific task

Local representations

Each term is a unique entity
Terms outside of the fixed vocabulary have no representation

Distributed representations

Each term is represented by a (spare or dense) vector of: hand-crafted features or a latent representation
This feature extraction procedure should allow the definition of "similarity" based on such properties.

Representation: be wise enough to choose the one that best suits your problem(II)

Distributional hypothesis

"A word is characterized by the company it keeps "

Firth (1957)

Observed

Representations that are measurable(explicitely) from the data. Categorized on the base of:

Distributional features : (e.g., in documents, neighbouring terms with or without distances)
Weighting schemes applied over the raw counts(e.g. TF-IDF)
Can capture interesting relationships but resultant representations are highly sparse and high-dimensional

Embeddings

Simpler representations that are learnt from data and assimilate the properties of the terms and the inter-term relationships observable in the original feature space.
NB: With both representation is possible to use cosine similarity as metric.

The quick brown fox jumps over the lazy dog

In-document features

Character-trigraph features

Neighbouring-term features

Neighbouring-term with distance features

Think sparse, but act dense

Why?

Enable inexact matching in the embedding space.
Computational complexity.
"Models that learn lower dimensional representations performs better than explicit counting-based models on different tasks—possibly due to better generalization across terms". (Levy et al., 2015)

Latent Semantic Analysis

Performs singular value decomposition on a term-document matrix X to obtain its low-rank approximation.

Non Neural Learning of embeddings

Neural Learning of embeddings

Typically trained by setting up a prediction task, motivated by the information bottleneck principle.

Term embeddings for IR

Unsupervised term embeddings can be utilized in 2 main ways:

Query-document matching directly in the latent space

Embedding based models often perform poorly when the retrieval is performed over the full document collection. For example, the model may know that a "passage" in text is about fruit, but it fails to realize that the passage is about banana

Telescoping

IDEA: Chain different IR models where each successive model re-ranks a smaller number of candidate documents
PRACTICE: Use the embedding based model to re-rank only a subset of the documents retrieved by a different IR model.

Query expansion

Use embeddings to find good expansion candidates from a global vocabulary, and then retrieving documents using the expanded query.

Learning to rank: a supervised tale

LTR for IR uses relevance information, such as labels and click data as training data to improve the pertinence of the ranking

Rankable item

Query-document pairs, represented by feature vectors $ \bar{x} $

Traditional models use represented hand-crafted features like:
- - Query-independent or static features (e.g., query length/document length)
- - Query-dependent or dynamic features (e.g., BM25)
Modern models use deep architecture for feature learning, starting from a simple vector representations of the input.
Additional parameters, such as the document popularity, could be used in the process
Ranking model

Trained to map the vector to a real-valued score s.t. for a given query the more relevant documents are scored higher and some chosen rank-based metric is maximized

Depending on the input several algorithm can be used: NN, SVM, Trees etc.
A LOSS function

Depending on the type of relevance (how it is encoded) can be calculated as:
- - Regression loss when the relevance information is numerical; a standard loss function, such as the square loss, can be employed.
  $$ L_{squared}=||True - Predicted||^2$$
- - Classification loss when the relevance information are labels: the probability of the correct label for the query/document pair $ \bar{x} $ can be obtained by the softmax function which normalizes the score of the correct label against the set of all possible labels.

About the INPUT...

Neural models that learn representations of text take raw text as input:

Character Level

Bag-of-trigraphs per term

Term level with pretrained term embeddings

About the MODEL...

Autoencoders

Unsupervised learning model based on the information bottleneck method, the train consists in feeding in the high-dimensional vector inputs and trying to reconstruct the same representation at the output layer

In the process the parameters that encode & decode the input are adjusted to minimize the squared loss (usually) between the reconstructed output and the input

Variational autoencoders

The encoder generates two separate vectors: means $ \vec{\mu} $ and deviations $ \vec{\sigma} $ (latent space is 2*z).

The latent representation is generated by sampling a gaussian along each of the k latent dimensions.

By sampling the latent representation, we expose the decoder to a certain degree of local variations in its input that should force the model to learn a smoother continuous latent space.

An important application of VAE is for the synthesis of new items or text not observed in the training collection.

Siamese Network

The siamese network's architecture consists of 2 models ( $ m1,m2 $), that projects 2 inputs ( $ i1,i2 $) to the same latent space, thus obtaining ( $ v1,v2 $)

The distance(usually cosine similarity) is then computed and the parameters, which are shared, are optimized such that ( $ v1,v2 $) are closer when when expected to be close & further otherwise.

For example, the features representing all the documents about “banana bread” should be very close to each other, but far away from feature clusters of all other documents.

The LOSS is usually the triplet loss: where a baseline vector $ \vec q $ (query) is compared against a positive(relevant) vector $ \vec p $ (e.g. document clicked by the user) and a negative vector $ \vec n $ (falsy document sampled with uniform probability from the full collection).

A "deep" neural model for ad-hoc retrieval

MODEL is a deep autoencoder trained under unsupervised setting on unlabelled document collection

PREPROCESSING consisted of removing common stopwords, stemming, and then only considering the 2000 most frequent words in the corpus (training set).

IMPLEMENTATION map semantically similar documents to nearby addresses (in memory), allowing for a fast retrieval

INPUT: is composed by word-count (of the 2000 words) of a document

INDEXING: a document is mapped to a word-count vector and then this vector is passed through autoencoder and encoded to 32-bit address.

OUTPUT: a learned binary code for the document which represents its matching address (in memory)

RETRIEVAL makes use of TELESCOPING technique: the hashing is used to preselect the top x documents in the query address or in close addresses (up to 4 bits in hamming distance). The subset is then reranked via TF-IDF. (sectio 4.3 of the paper)

PROS: Semantic hashing is independent of the size of the document collection n and linear in the size of the shortlist of similar documents. LSA search time, for example, linearly depends on the size of the corpus.

RESULTS: experiments proved that the use of semantic hashing as a filter for TF-IDF lead to a higher precision and recall than TF-IDF applied to the whole document, in a much faster time

Interaction based network

IDEA:

Compare different parts of the query with different parts of the document, then, aggregate these partial evidences of relevance

This operation could be very usefull when dealing with long documents which may contain a mixture of many topics

Implementation:

A sliding window is moved over both the query and the document text and each instance of the window over the query is compared against each instance of the window over the document text, generating an "interaction matrix"

A neural model (typically convolutional) operates over the generated interaction matrix and aggregates the evidence across all the pairs of windows compared, to find patterns of matches that suggest relevance of the document to the query

Lexical & semantic matching

Context:

Most of the applications of NN to IR are about finding good embeddings, that is, a good representation of text. These representation presents both advantages & disadvantages.

Embedding based models often perform poorly on retrieval task of specific terms like proper names of companies, places etc. (e.g. Lee's sausage company), since it is unlikely that the model would have a good representation for such term.

On the other hand, a lexical based matches would not work when the system is asked something "implicit" like: "On which channel is the Ajax playing today"? The target document will probably contains proper names of channels, like Rai 1 or Canale 5 but not the term "channel" per se.

A duet architecure: A good neural IR model should incorporate both lexical and semantic matching signals.

Conclusions

Desiderata of a model

Of short and long text

Retrieval of long text: a model must deal with variable length documents where the relevant sections (to the query) may be surrounded by (a lot of) irrelevant text.

Retrieval of short text: a model must deal with query-document vocabulary mismatch problem, by learning how patterns of query terms and (different) document terms can indicate relevance.

In either cases, a model should also consider lexical matches when the query contains rare terms (not seen during training) to avoid retrieving semantically related but irrelevant results.

context

Ideal IR models should be able to discriminate or rank between documents inferring the meaning of a query from context.

For example, if one searches for "soccer world cup winner", it is highly probable that what he/she wants is the last edition's winner. And that should be understood by the model via the context or the user’s short or long-term history.

The need of labelled data

IR is “a little behind” wrt CV & NLP mostly because it suers heavily (and for good reasons) for lack of annotated (labelled) document collections, for privacy reasons.
Present problems and future goals

"Should the ideal IR model behave like a library that knows about everything in the Universe, or like a librarian who can effectively retrieve without memorizing the corpus"

Mitra (2018)

A brute force approach