lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: what is the format of .tim and .tiq in lucene 4.0 ?
Date Fri, 16 Nov 2012 12:07:52 GMT
The format is unfortunately rather intricate ...

FST = finite state transducer (see eg
http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html
).  We use that to hold the terms index (*.tip), which is loaded into
RAM.

The blocks are because we encode a block of between 25 - 48 terms
together.  Blocks are picked according to how terms share prefixes so
that we get better compression and faster loookup.  It's a variant of
a burst trie (see eg
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499 ).

The index points to the start of blocks, so in looking up a term we
figure out from the index which block may have the term (if any), seek
there, and scan for it.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Nov 16, 2012 at 3:57 AM, wgggfiy <wuqiu.reg@qq.com> wrote:
> Hi, guys.I'm now studying lucene 4.0, and come into difficulties.Compared
> previous version, the term dictionary is not like this version.what is block
> ? and what is the FST ?help me, thx.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-format-of-tim-and-tiq-in-lucene-4-0-tp4020677.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message