lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carlos González-Cadenas (Commented) (JIRA) <>
Subject [jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB
Date Sat, 03 Dec 2011 15:32:40 GMT


Carlos González-Cadenas commented on LUCENE-3298:

Hello Dawid:

First of all, for the FSTLookup we're using a FST and not a FSA, that means that we've replaced
the NoOutputs with ByteSequenceOutputs. The reason for this is that our autocomplete system
requires to get some metadata about the sentence being matched, because we do some post-match
reranking of the results in order to get quality suggestions (the buckets themselves are quite
big and the results are quite poor without reranking, specially for short prefixes)

The queries we use with the FSTLookup are generated using templates. Our domain is hotels,
simplifying things that means that we have K templates (i.e. an example template could be
hotels + amenity + city), then we have I individuals (i.e. all the cities in the world ),
and we generate queries like:

hotels with jacuzzi in barcelona

The output for this key contains some metadata that is useful for reranking matches or displaying
the matches to the user. We encode it in as few bytes as possible, and we make sure that things
that are common to many sentences are put first so that they can get compacted appropriately.
The metadata in the output is the following (in order):

1) Display version of the sentence. The match sentence described above is normalized and transliterated
to ASCII, i.e. it converts acutes to normal characters, and stuff like this. We have the "display"
version to show the correctly formatted version to the user. We apply simple transliteration
rules so that the differnece between the display and the match is not very big (i.e. , we
don't apply stuff like pinyin transliteration for chinese).
2) The ID of the city (an int that will help us to later execute the query and give the right
3) The "score" of the sentence (float, used to rerank the matches)
4) A byte to represent the number of alternate names used in the query (i.e. "barcelona" is
also referred as "la ciudad condal", if the prefix is "hotels in" we don't want to produce
both matches, just the most common one. This value is used for reranking.
5) The number of skipped tokens (we use this for infix matches). This is another byte. This
is used for reranking, to give priority to prefix queries (skipped tokens = 0) over infix
queries, and if all the matches are infix, to give the ones where the matched word is closer
to the beginning of the sentence).

The sentences are quite repetitive, but we don't seem to achieve an optimal compression. A
GZ-compressed file of 250MB is generating a FST of around 1.9GB. 

We've applied the patch to trunk, but I cannot get it to compile. Maybe the version that is
uploaded to JIRA is out of date. Do you have a newer version?

Thanks a lot,

> FST has hard limit max size of 2.1 GB
> -------------------------------------
>                 Key: LUCENE-3298
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-3298.patch
> The FST uses a single contiguous byte[] under the hood, which in java is indexed by int
so we cannot grow this over Integer.MAX_VALUE.  It also internally encodes references to this
array as vInt.
> We could switch this to a paged byte[] and make the far larger.
> But I think this is low priority... I'm not going to work on it any time soon.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message