lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Some interesting IR ideas
Date Tue, 03 Jan 2012 00:53:54 GMT
Hi Lucene geeks,

We have a whole new year in front of us and we don't want to become 
bored, do we... so I thought I'd share some interesting ideas that I 
encountered over the past few months, while reading now and then a bunch 
papers on IR. No code yet, sorry! just wondering what it would be like 
if Lucene supported this or that functionality. Feel free to say "nuts" 
or "useless" or "brilliant" or anything in between. Or come up with your 
ideas!

Mainly the following concepts are about maintaining additional index 
data for improved performance or functionality. Experimentation in this 
area became practical now in trunk with the completion of the Codec API, 
but there may be still some things missing in the API-s, for example the 
ability to discover, select and process sub-lists of postings, or 
customization of query evaluation algorithms, etc.

Some of these ideas got implemented as a part of the original research - 
I'm sorry to say that nearly none of them used Lucene, usually it was 
either Zetair or Terrier. I'd blame pre-flex API-s for this, so 
hopefully the situation will improve in the coming years.

So, here we go.

1. Block-Max indexes
====================
The idea is presented fully here: 
http://cis.poly.edu/suel/papers/bmw.pdf . Basically, it's about skipping 
parts of posting lists that are unlikely to contribute to the top-N 
documents. The parts of the lists are marked with, well, tombstones, 
that carry a value, which is the maximum score of a term query for a 
given range of the doc-ids (under some metric). For some types of 
queries it's possible to predict whether any possible matches in a given 
portion of the posting list will produce a candidate that fits in the 
top-N docids, based on the maximum value of a term score (or any other 
useful metric for that matter). You can read the gory details of query 
eval. in the paper. This is a part of a broader topic of dynamic pruning 
of query eval. and I have a dozen or so other references on this.

In Lucene, we could handle such tombstones using a specialized codec. 
However, I think the query evaluation mechanism wouldn't be able to use 
this information to skip certain ranges of docs... or maybe it could be 
implemented as filters initialized from tombstone values?

2. Time-machine indexes
=======================
This is really a variant of the above, only the tombstones record 
timestamps (and of course the index is allowed to hold duplicates of 
documents).

We can already do an approximation of this by limiting query evaluation 
only to the latest segments (if we can guarantee that segment creation / 
merging follows monotonically increasing timestamps). But using 
tombstones we could merge segments from different periods of time, as 
long as we guarantee that we don't split&shuffle blocks of postings that 
belong to the same timestamp.

Query evaluation that concerns a time range would then be able to skip 
directly to the right tombstones based on timestamps (plus some 
additional filtering if tombstones are too coarse-grained). No idea how 
to implement this with the current API - maybe with filters, as above?

Note that the current flex API always assumes that postings need to be 
fully decoded for evaluation, because the evaluation algorithms are 
codec-independent. Perhaps we could come up with an api that allows us 
to customize the evaluation algos based on codec impl?

3. Caching results as an in-memory inverted index
=================================================
I can't find the paper right now ... perhaps it was by Torsten Suel, who 
did a lot of research on the topic of caching. In Solr we use caches for 
caching docsets from past queries, and we can do some limited 
intersections for simple boolean queries. The idea here is really 
simple: since we already pull in results and doc fields (and we know 
what terms contribute to these results, from re-written queries, so we 
could provide these too) we could use this information to create a 
memory-constrained inverted index that will answer not only simple 
boolean queries using intersections of bitsets, but possibly also other 
queries that require full query evaluation - and under some metric we 
could decide that results are either exact, good enough, or need to be 
evaluated against the full index. We could then periodically prune this 
index based on LFU, LRU or some such strategy.
Hmm, part of this idea is here, I think: 
http://www2008.org/papers/pdf/p387-zhangA.pdf or here: 
http://www2005.org/cdrom/docs/p257.pdf

BTW, there are dozens of papers on caching in search engines, for 
example this: 
http://www.hugo-zaragoza.net/academic/pdf/blanco_SIGIR2010b.pdf - here 
the author argues against throwing away all cached lists after an index 
update (which we do in Solr), and instead to keep those lists that are 
likely to give identical results as before the update.

4. Phrase indexing
==================
Again, I lost the reference to the paper that describes this ... I'll 
find it later. Phrase indexing is of course well known, and has 
well-known benefits and costs (mostly prohibitive except for very 
limited number of phrases). The idea here is to index phrases in such a 
way that the term dictionary (and postings) consists only of relatively 
long phrases, and postings for all shorter phrases subsumed by the long 
phrases are put in the same posting lists. Now, the dictionary needs to 
store also pointers from each leading term of the shorter phrases to the 
corresponding longer phrase entry, so that we can find the right 
postings given a shorter phrase. And postings are also augmented with 
bitmasks that determine what terms in the phrase match in which document 
on a list.

(Hmm, maybe it was this paper? 
http://www.mpi-inf.mpg.de/~bast/papers/autocompletion-sigir.pdf)

5. Chunk-level indexing
=======================
It's basically a regular index, only we add terms with coarse-grained 
position information - instead of storing positions for every posting 
(or none) we store only "chunk" numbers, where "chunk" could be 
interpreted as a sentence (or a page, or a paragraph, or a chunk ;) ). 
 From the point of view of the API this would translate to several 
postings with position increment 0, i.e. several terms would end up at 
the same positions. Obviously, this lossy encoding of term proximity 
would save a lot of space and would speed up proximity query evaluation, 
at the cost of matching with coarse "slop" - but even then we would know 
that the slop is limited to the chunk size, which is often good enough. 
Phrase/span scorers would have to understand that they are looking for 
terms that have the same (equal) "chunk" number, and score them 
accordingly (whereas the regular phrase scorer looks for postings with 
posIncr != 0 or posIncr=1 for exact phrases).

The following paper discusses this concept in detail: 
http://www.isi.edu/~metzler/papers/elsayed-cikm11.pdf and this one 
(paywall): http://www.springerlink.com/index/T5355418276V7115.pdf

6. Stored fields compaction for efficient snippet generation
============================================================
This time I have the links to the papers: 
http://www.springerlink.com/index/j2774187g532603t.pdf and 
http://www.edbt.org/Proceedings/2011-Uppsala/papers/edbt/a10-ceccarelli.pdf 
. The idea again is quite simple: instead of using full text for snippet 
generation and highlighting, why not choose the best candidate snippets 
in advance, and store/cache/highlight only these.


And finally some other odd-ball links to cool papers:

* Hauff, C. (2010). Predicting the effectiveness of queries and 
retrieval systems. 
http://eprints.eemcs.utwente.nl/17338/01/final_version_LR.pdf - 
concerning the evaluation of query complexity and routing the queries to 
indexes (or nodes) that can best answer the queries. See also 
http://www.morganclaypool.com/doi/pdf/10.2200/S00235ED1V01Y201004ICR015 
(behind a paywall). Now that we can efficiently construct subsets of 
indexes on the fly I'm really tempted to implement the tiered search 
model that I presented at Lucene Revolution, unless someone beats me to it.

* F Claude, A Farina (2009). Re-Pair Compression of Inverted Lists. 
http://arxiv.org/pdf/0911.3318  - on the surface it's a wild idea that 
apparently works... it's an LZ-like compression method for postings and 
a set of algos for intersections of these lists without decompression.


I think that's it for now... Enjoy!

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message