lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: Autocompletion on large index
Date Thu, 07 Jul 2011 09:09:42 GMT
Elmer. Tst will have a large overhead. Fst may not be that much better if
your input has very few shared pre or suffixes. In your case i think this is
unfortunately true. What i would do is create a regular lucene index and
store it on disk. Then run prefix queries on it. Should work and scale to
large number of ops per sec. See lucene revolution 2011 talks - there was a
talk about using just this instead of a completion module.

Like mike said though, it'd be interesting to investigate on your data.
On Jul 6, 2011 8:52 PM, "Elmer" <evanchastelet@gmail.com> wrote:
> I just profiled the application and tst.TernaryTreeNode takes 99.99..% of
> the memory.
>
> I'll test further tomorrow and report on mem usage for runnable smaller
> indexes.
> I will email you privately for sharing the index to work with.
>
> BR,
> Elmer
>
>
> -----Oorspronkelijk bericht-----
> From: Michael McCandless
> Sent: Wednesday, July 06, 2011 8:39 PM
> To: java-user@lucene.apache.org
> Subject: Re: Autocompletion on large index
>
> Hmm... so I suspect the fst suggest module must first gather up all
> titles, then sort them, in RAM, and then build the actual FST. Maybe
> it's this gather + sort that's taking so much RAM?
>
> 1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So
> that shouldn't be it...
>
> Is this a an accessible corpus? Can I somehow get a copy to play with...?
>
> Are you able to [temporarily, once] build the full FST and other
> suggest impls and compare how much RAM is required for building and
> then lookups?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Jul 6, 2011 at 1:50 PM, Elmer <evanchastelet@gmail.com> wrote:
>> Hi Mike,
>>
>> That's what I thought when I started indexing it. To be clear, it happens

>> on
>> build time.
>> I don't know if memory efficiency is better when building has finished.
>>
>> The titles I index are titles from the dblp computer sience bibliography.
>> They can take up to... say 100 characters.
>> Examples:
>> -------
>> - Auditory stimulus optimization with feedback from fuzzy clustering of
>> neuronal responses
>> - Two-objective method for crisp and fuzzy interval comparison in
>> optimization
>> - Bound Constrained Smooth Optimization for Solving Variational
>> Inequalities
>> and Related Problems
>> - Retrieval of bibliographic records using Apache Lucene
>> - Digital Library Information Appliances
>> -------
>>
>> The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter
in
>> that order.
>>
>> I also tried to do the same for the author names, and this works without
>> problems. Actually it builds the tree/fsa/... faster from dictionary than
>> from file (the lookup data file that can be stored and loaded through the
>> .store and .load methods). But the larger set of publication titles is
>> currently no-go with 2.5GB of heapspace, only having a main class that
>> builds the LookUp data.
>>
>> BR,
>> Elmer
>>
>>
>> -----Oorspronkelijk bericht----- From: Michael McCandless
>> Sent: Wednesday, July 06, 2011 6:23 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Autocompletion on large index
>>
>> You could try storing your autocomplete index in a RAMDirectory?
>>
>> But: I'm surprised you see the FST suggest impl using up so much RAM;
>> very low memory usage is one of the strengths of the FST approach.
>> Can you share the text (titles) you are feeding to the suggest module?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Wed, Jul 6, 2011 at 12:08 PM, Elmer <evanchastelet@gmail.com> wrote:
>>>
>>> Hi again.
>>>
>>> I have created my own autocompleter based on the spellchecker. This
>>> works well in a sense that it is able to create an auto completion index
>>> from my 'publication' index. However, integrated in my web application,
>>> each keypress asks autocompleter to search the index, which is stored on
>>> disk (not in mem), just like spellchecker does (except that spellchecker
>>> is not invoked every keypress).
>>> With Lucene 3.3.0, auto completion modules are included, which load
>>> their trees/fsa/... in memory. I'd like to use these modules, but the
>>> problem is that they use more than 2.5GB, causing heap space exceptions.
>>> This happens when I try to build a LookUp index (fst,jaspell or tst,
>>> doesn't matter) from my 'publication' index consisting of 1.3M
>>> publications. The field I use for autocompletion holds the titles of the
>>> publications indexed untokenized (but lowercased).
>>>
>>> Code:
>>> Lookup autoCompleter = new TSTLookup();
>>> FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
>>> LuceneDictionary dict = new
>>> LuceneDictionary(IndexReader.open(dir),"title_suggest");
>>> autoCompleter.build(dict);
>>>
>>> Is it possible to have the autocompletion module to work in-memory on
>>> such a dataset without increasing java's heapspace?
>>> FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
>>> my own autocompleter index is stored on disk using about 300MB.
>>>
>>> BR,
>>> Elmer
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message