lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB
Date Sat, 03 Dec 2011 16:50:39 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162177#comment-13162177
] 

Dawid Weiss commented on LUCENE-3298:
-------------------------------------

The first thing that comes to my mind is to use an int or long file offset as the output and
store all other outputs in another file. This will help you keep the FST smaller and thus
store more arcs inside.

So, I don't count the outputs for now. The example you've given is too simple to say anything
-- this should conflate prefixes nicely. Although if you have lots of combinations of different
prefixes (say, "motels with jacuzzi in [barcelona|madrid|berlin]" then I can see how the automaton
can explode there on internal nodes (if your input "suggestions" are very long and have lots
of infix variants). I don't see any direct solution to this.

There is a data structure called LZTrie which does infix compression (or rather: it attempts
to store similar subtrees in the automaton only once, replacing their duplicated occurrences
with a pointer). That data structure is quite difficult to implement efficiently (construction)
and the traversals are not that straightforward either. I don't have a working implementation
unfortunately but if you have time (and would like to contribute!) then its description is
here:

http://www.irb.hr/hr/home/ristov/papers/RistovLZtrieRevision1.pdf

That patch wasn't mine, by the way, so I don't have any newer one either.


                
> FST has hard limit max size of 2.1 GB
> -------------------------------------
>
>                 Key: LUCENE-3298
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3298
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-3298.patch
>
>
> The FST uses a single contiguous byte[] under the hood, which in java is indexed by int
so we cannot grow this over Integer.MAX_VALUE.  It also internally encodes references to this
array as vInt.
> We could switch this to a paged byte[] and make the far larger.
> But I think this is low priority... I'm not going to work on it any time soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message