lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Autocompletion on large index
Date Wed, 06 Jul 2011 18:39:33 GMT
Hmm... so I suspect the fst suggest module must first gather up all
titles, then sort them, in RAM, and then build the actual FST.  Maybe
it's this gather + sort that's taking so much RAM?

1.3 M publications times 100 chars times 2 bytes/char = ~248 MB.  So
that shouldn't be it...

Is this a an accessible corpus?  Can I somehow get a copy to play with...?

Are you able to [temporarily, once] build the full FST and other
suggest impls and compare how much RAM is required for building and
then lookups?

Mike McCandless

On Wed, Jul 6, 2011 at 1:50 PM, Elmer <> wrote:
> Hi Mike,
> That's what I thought when I started indexing it. To be clear, it happens on
> build time.
> I don't know if memory efficiency is better when building has finished.
> The titles I index are titles from the dblp computer sience bibliography.
> They can take up to... say 100 characters.
> Examples:
> -------
> - Auditory stimulus optimization with feedback from fuzzy clustering of
> neuronal responses
> - Two-objective method for crisp and fuzzy interval comparison in
> optimization
> - Bound Constrained Smooth Optimization for Solving Variational Inequalities
> and Related Problems
> - Retrieval of bibliographic records using Apache Lucene
> - Digital Library Information Appliances
> -------
> The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter in
> that order.
> I also tried to do the same for the author names, and this works without
> problems. Actually it builds the tree/fsa/... faster from dictionary than
> from file (the lookup data file that can be stored and loaded through the
> .store and .load methods). But the larger set of publication titles is
> currently no-go with 2.5GB of heapspace, only having a main class that
> builds the LookUp data.
> BR,
> Elmer
> -----Oorspronkelijk bericht----- From: Michael McCandless
> Sent: Wednesday, July 06, 2011 6:23 PM
> To:
> Subject: Re: Autocompletion on large index
> You could try storing your autocomplete index in a RAMDirectory?
> But: I'm surprised you see the FST suggest impl using up so much RAM;
> very low memory usage is one of the strengths of the FST approach.
> Can you share the text (titles) you are feeding to the suggest module?
> Mike McCandless
> On Wed, Jul 6, 2011 at 12:08 PM, Elmer <> wrote:
>> Hi again.
>> I have created my own autocompleter based on the spellchecker. This
>> works well in a sense that it is able to create an auto completion index
>> from my 'publication' index. However, integrated in my web application,
>> each keypress asks autocompleter to search the index, which is stored on
>> disk (not in mem), just like spellchecker does (except that spellchecker
>> is not invoked every keypress).
>> With Lucene 3.3.0, auto completion modules are included, which load
>> their trees/fsa/... in memory. I'd like to use these modules, but the
>> problem is that they use more than 2.5GB, causing heap space exceptions.
>> This happens when I try to build a LookUp index (fst,jaspell or tst,
>> doesn't matter) from my 'publication' index consisting of 1.3M
>> publications. The field I use for autocompletion holds the titles of the
>> publications indexed untokenized (but lowercased).
>> Code:
>> Lookup autoCompleter = new TSTLookup();
>> FSDirectory dir = File("PATHTOINDEX"));
>> LuceneDictionary dict = new
>> LuceneDictionary(,"title_suggest");
>> Is it possible to have the autocompletion module to work in-memory on
>> such a dataset without increasing java's heapspace?
>> FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
>> my own autocompleter index is stored on disk using about 300MB.
>> BR,
>> Elmer
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message