lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "LucenePapers" by jpountz
Date Sun, 24 Jun 2012 12:34:57 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The "LucenePapers" page has been changed by jpountz:
http://wiki.apache.org/lucene-java/LucenePapers

New page:
= Lucene Papers =

To understand the fundamental ideas behind Lucene, you should first get familiar with InformationRetrieval.
This page tries to collect links to resources that present more advanced ideas.

== Storage ==

=== Postings list encoding ===

In addition to VInt encoding, Lucene supports (or plans to support) other postings list encoding
formats (FOR, PFOR, Simple9 ...):

 * [[http://www2008.org/papers/pdf/p387-zhangA.pdf|Performance of Compressed Inverted List
Caching in Search Engines]]. Jiangong Zhang, Xiaohui Long, Torsten Suel. (2008)
 * [[http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html|Lucene
performance with the PForDelta codec]]. Mike McCandless, Changing bits, August 2nd, 2010.

=== The Pulsing codec ===

An optimized codec for fields that have lots of rare terms.

 * [[http://www.jopedersen.com/Publications/cutting90optimizations.pdf|Optimizations for Dynamic
Inverted Index maintenance]]. Doug Cutting, Jan Pedersen.
 * [[http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html|Lucene's
PulsingCodec on "Primary Key" Fields]]. Mike McCandless, Changing bits, June 5th, 2010.

== Query execution ==

=== Terms dictionary ===

Lucene has a new block tree terms dictionary, inspired of burst tries.

 * [[https://issues.apache.org/jira/browse/LUCENE-3030|LUCENE-3030 Block tree terms dict &
index]],
 * [[http://www.lucidimagination.com/sites/default/files/file/LR2012/AutomatonInvasionLuceneRevolution2012.pdf|Automata
invasion]] Robert Muir, Michael McCandless,
 * [[http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499|Burst Tries: A Fast, Efficient
Data Structure for String Keys]]. Steffen Heinz , Justin Zobel , Hugh E. Williams. (2002)

=== NumericRangeQuery ===

Lucene has an optimized range query implementation for numeric types:

 * [[http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/NumericRangeQuery.html|NumericRangeQuery]],
 * [[http://dx.doi.org/10.1016/j.cageo.2008.02.023|Generic XML-based Framework for Metadata
Portals. Computers & Geosciences 34 (12), 1947-1955]]. Schindler, U, Diepenbroek, M (2008).

=== Automaton-based fuzzy query ===

Lucene 4.0 supports an improved fuzzy query implementation that is based on Levenshtein automata.

 * [[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652|Fast String Correction
with Levenshtein-Automata.]] Klaus Schulz , Stoyan Mihov. (2002)
 * [[http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html|Lucene's
FuzzyQuery is 100 times faster in 4.0]]. Mike McCandless, Changing bits, March 24th, 2011.

== Misc ==

=== FST compression ===

Lucene uses FSTs a lot, so their in-memory size is important.

 * [[http://www.cs.put.poznan.pl/dweiss/site/publications/download/fsacomp.pdf|Smaller Representation
of Finite State Automata]]. Jan Daciuk, Dawid Weiss.

Mime
View raw message