lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: FST and FieldCache?
Date Thu, 19 May 2011 10:09:17 GMT
We should add this (lookup by value, when value is guaranteed to
monotonically increase as the key increases) to our core FST APIs?
It's generically useful in many places ;)  I'll open an issue.

EG this would also enable an FST terms index that supports
lookup-by-ord, something VariableGapTermsIndex (this is the one that
uses FST for the index) does not support today.

David, one thing to remember is trunk has already seen drastic
reductions on the RAM required to store DocTerms/Index vs 3.x
(something maybe we should backport to 3.x...).  The bytes for the
terms are now stored as shared byte[] blocks, and the ords/offsets are
stored as packed ints, so we no longer have per-String memory&
pointer overhead.  I describe the gains here:
http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html
-- though, those gains include RAM reduction from the terms index as
well.

While FST w/ lookup-by-monotonic-value would work here, I would be
worried about the per hit of that representation vs what
DocTerms/Index offers today... we should test to see.  Of course, for
certain apps that perf hit is justified, so probably we should make
this an option when populating field cache (ie, in-memory storage
option of using an FST vs using packed ints/byte[]).

Mike

http://blog.mikemccandless.com

On Thu, May 19, 2011 at 4:43 AM, Dawid Weiss
<dawid.weiss@cs.put.poznan.pl> wrote:
>> Though, it's possible to do if you associate an additional number with
>> each node. (I invented some way, shared it with Mike and forgot.. good
>> grief :/)
>
> It doesn't need to be invented -- it's a known technique. On each arc
> you store the number of strings under that arc; while traversing you
> accumulate -- this gives you a unique number for each string (perfect
> hash) and a way to locate a string given its number.
>
>> And it can't do the reverse lookup, by design, that's a lossy
>> compression for all good perfect hashing algos.
>> So, it's irrelevant here, huh?
>
> You can do both the way I described above. Jan Daciuk has details on
> many more variants of doing that:
>
> Jan Daciuk, Rafael C. Carrasco, Perfect Hashing with Pseudo-minimal
> Bottom-up Deterministic Tree Automata, Intelligent Information Systems
> XVI, Proceedings of the International IIS'08 Conference held in
> Zakopane, Poland, June 16-18, 2008, Mieczysław A. Kłopotek, Adam
> Przepiórkowski, Sławomir T. Wierzchoń, Krzysztof Trojanowski (eds.),
> Academic Publishing House Exit, Warszawa 2008.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message