lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (" <>
Subject Re: FST and FieldCache?
Date Thu, 19 May 2011 15:58:49 GMT
Wow, 17 replies to my email overnight! This is clearly an interesting topic
to folks.

Hi Dawid.
Sadly, I won't be at Lucene Revolution next week. That's where all the cool
kids will be; I'll be home and be square. I made it to O'Reilly Strata in
February (a great conference) and I'll be presenting at Basis's "Open Source
Search Conference" (government customer focused) mid-June.  I've used up my
conference budget for this fiscal year.

Yes, the use-case here is a unique integer reference to a String that can be
looked up fairly quickly, whereas the set of all strings are in a compressed
data structure that won't change after its built. A bonus benefit would be
that this integer is a sortable substitute for the String.  Your observation
of this integer being a perfect-hash is astute.

I wonder if Lucene could store this FST on-disk for the bytes in a segment
instead of what it's doing now? Read-time construction would be super-fast,
though for multi-segment indexes, I suppose they'd need to be merged.

I expect that this use-case would be particularly useful for cases when you
know that the set of strings tends to have a great deal of prefixes in
common, such as when EdgeNGramming (applications: query-complete,
hierarchical faceting, prefix/tree based geospatial indexing).

~ David Smiley

Dawid Weiss wrote:
> Hi David,
>> but with less memory.  As I understand it, FSTs are a highly compressed
>> representation of a set of Strings (among other possibilities).  The
> Yep. Not only, but this is one of the use cases. Will you be at Lucene
> Revolution next week? I'll be talking about it there.
>> representation of a set of Strings (among other possibilities).  The
>> fieldCache would need to point to an FST entry (an "arc"?) using
>> something
>> small, say an integer.  Is there a way to point to an FST entry with an
>> integer, and then somehow with relative efficiency construct the String
>> from
>> the arcs to get there?
> Correct me if my understanding is wrong: you'd like to assign a unique
> integer to each String and then retrieve it by this integer (something
> like a
> Map&lt;Integer, String&gt;)? This would be something called perfect
> hashing
> and this can be done on top of an automaton (fairly easily). I assume
> the data structure is immutable once constructed and does not change
> too often, right?
> Dawid
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View this message in context:
Sent from the Lucene - Java Developer mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message