lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carlos González-Cadenas (Commented) (JIRA) <>
Subject [jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB
Date Mon, 05 Dec 2011 12:47:40 GMT


Carlos González-Cadenas commented on LUCENE-3298:

Yeap, at the beginning of this project we tried to implement this autocomplete system using
regular inverted indexes, but the response time required for autocomplete to work from a user
perspective is very low (<50ms), and it would be quite hard to achieve such a performance
with inverted indexes. 

I still think this is the way to go, but as you say we have to be careful with the data generation
part. Most of the work should be put in making sure that the data is well distributed and
organized in order to avoid combinatorial explosion.

Let me go in detail with the sources of data permutations and the reasoning behind them:

1) With regards to infix matches, if a user types "barcelona" we want to match "hotels in
barcelona". In order to achieve this, we generate:

hotels in barcelona => hotels in barcelona
in barcelona => hotels in barcelona
barcelona => hotels in barcelona

The FST should be able to conflate these prefixes nicely in just one path, right?. Therefore
this part shouldn't be a problem.

2) In addition, another feature we want to achieve is to be able to match inputs without prepositions.
That means that if the user types "hotels barcelona jacuzzi", we should be able to match "hoteles
in barcelona with jacuzzi". Now the only way we envision of doing it properly is to generate
this permutation within the data:

hotels barcelona jacuzzi => hotels in barcelona with jacuzzi

I can see how this can explode the FST by creating different branches. Theoretically this
could be done at runtime without the need of generating the data, but we don't see a way to
do it in a clean way. To make things more complicated :) we've implemented fuzzy matching
at query time (we use a levenshtein automata generated with the user input + an edit distance
and then we intersect with the FST), and this is making very complicated to do preposition
handling at query time. 

3) PP permutations (i.e. hoteles in barcelona with jacuzzi and hoteles with jacuzzi in barcelona).
I don't really see a way to work around this. Probably we need to be careful and only generate
these permutations for the top-K cities, in order to limit the potential size.

Summarizing, I believe that we can reduce the set of "bad permutations" a lot if we can figure
out how to implement the prepositions at runtime. If you have any ideas, let me know. Thanks!

> FST has hard limit max size of 2.1 GB
> -------------------------------------
>                 Key: LUCENE-3298
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-3298.patch
> The FST uses a single contiguous byte[] under the hood, which in java is indexed by int
so we cannot grow this over Integer.MAX_VALUE.  It also internally encodes references to this
array as vInt.
> We could switch this to a paged byte[] and make the far larger.
> But I think this is low priority... I'm not going to work on it any time soon.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message