Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of dawid.weiss@gmail.com
 designates 209.85.218.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <EB02D4123D664693B338F5CBBA8772C0@ElmerPC>
References: <1309968498.25963.17.camel@elmer-P35-DS3P>
	<CAL8PwkaK18wqCvAmvg9L3tS1-hxX5KKcf1Buo2n45256AzSgBg@mail.gmail.com>
	<5D6C36CFCB0B4AF38BEAC2E0240D22E5@ElmerPC>
	<CAL8Pwka3p=-WBQ+UUQzTbvCypgGHi03gKzUHp-v_BjWUzz6=yw@mail.gmail.com>
	<EB02D4123D664693B338F5CBBA8772C0@ElmerPC>
Date: Thu, 7 Jul 2011 11:09:42 +0200
Message-ID: 
 <CAM21Rt_T+QZyC4bH4+oE7+kcttpyny6dFpmGVZ3mBAGkZfBBnQ@mail.gmail.com>
Subject: Re: Autocompletion on large index
From: Dawid Weiss <dawid.weiss@gmail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=001485f87cf6d9862004a777139e

--001485f87cf6d9862004a777139e
Content-Type: text/plain; charset=UTF-8

Elmer. Tst will have a large overhead. Fst may not be that much better if
your input has very few shared pre or suffixes. In your case i think this is
unfortunately true. What i would do is create a regular lucene index and
store it on disk. Then run prefix queries on it. Should work and scale to
large number of ops per sec. See lucene revolution 2011 talks - there was a
talk about using just this instead of a completion module.

Like mike said though, it'd be interesting to investigate on your data.
On Jul 6, 2011 8:52 PM, "Elmer" <evanchastelet@gmail.com> wrote:
> I just profiled the application and tst.TernaryTreeNode takes 99.99..% of
> the memory.
>
> I'll test further tomorrow and report on mem usage for runnable smaller
> indexes.
> I will email you privately for sharing the index to work with.
>
> BR,
> Elmer
>
>
> -----Oorspronkelijk bericht-----
> From: Michael McCandless
> Sent: Wednesday, July 06, 2011 8:39 PM
> To: java-user@lucene.apache.org
> Subject: Re: Autocompletion on large index
>
> Hmm... so I suspect the fst suggest module must first gather up all
> titles, then sort them, in RAM, and then build the actual FST. Maybe
> it's this gather + sort that's taking so much RAM?
>
> 1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So
> that shouldn't be it...
>
> Is this a an accessible corpus? Can I somehow get a copy to play with...?
>
> Are you able to [temporarily, once] build the full FST and other
> suggest impls and compare how much RAM is required for building and
> then lookups?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Jul 6, 2011 at 1:50 PM, Elmer <evanchastelet@gmail.com> wrote:
>> Hi Mike,
>>
>> That's what I thought when I started indexing it. To be clear, it happens

>> on
>> build time.
>> I don't know if memory efficiency is better when building has finished.
>>
>> The titles I index are titles from the dblp computer sience bibliography.
>> They can take up to... say 100 characters.
>> Examples:
>> -------
>> - Auditory stimulus optimization with feedback from fuzzy clustering of
>> neuronal responses
>> - Two-objective method for crisp and fuzzy interval comparison in
>> optimization
>> - Bound Constrained Smooth Optimization for Solving Variational
>> Inequalities
>> and Related Problems
>> - Retrieval of bibliographic records using Apache Lucene
>> - Digital Library Information Appliances
>> -------
>>
>> The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter
in
>> that order.
>>
>> I also tried to do the same for the author names, and this works without
>> problems. Actually it builds the tree/fsa/... faster from dictionary than
>> from file (the lookup data file that can be stored and loaded through the
>> .store and .load methods). But the larger set of publication titles is
>> currently no-go with 2.5GB of heapspace, only having a main class that
>> builds the LookUp data.
>>
>> BR,
>> Elmer
>>
>>
>> -----Oorspronkelijk bericht----- From: Michael McCandless
>> Sent: Wednesday, July 06, 2011 6:23 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Autocompletion on large index
>>
>> You could try storing your autocomplete index in a RAMDirectory?
>>
>> But: I'm surprised you see the FST suggest impl using up so much RAM;
>> very low memory usage is one of the strengths of the FST approach.
>> Can you share the text (titles) you are feeding to the suggest module?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Wed, Jul 6, 2011 at 12:08 PM, Elmer <evanchastelet@gmail.com> wrote:
>>>
>>> Hi again.
>>>
>>> I have created my own autocompleter based on the spellchecker. This
>>> works well in a sense that it is able to create an auto completion index
>>> from my 'publication' index. However, integrated in my web application,
>>> each keypress asks autocompleter to search the index, which is stored on
>>> disk (not in mem), just like spellchecker does (except that spellchecker
>>> is not invoked every keypress).
>>> With Lucene 3.3.0, auto completion modules are included, which load
>>> their trees/fsa/... in memory. I'd like to use these modules, but the
>>> problem is that they use more than 2.5GB, causing heap space exceptions.
>>> This happens when I try to build a LookUp index (fst,jaspell or tst,
>>> doesn't matter) from my 'publication' index consisting of 1.3M
>>> publications. The field I use for autocompletion holds the titles of the
>>> publications indexed untokenized (but lowercased).
>>>
>>> Code:
>>> Lookup autoCompleter = new TSTLookup();
>>> FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
>>> LuceneDictionary dict = new
>>> LuceneDictionary(IndexReader.open(dir),"title_suggest");
>>> autoCompleter.build(dict);
>>>
>>> Is it possible to have the autocompletion module to work in-memory on
>>> such a dataset without increasing java's heapspace?
>>> FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
>>> my own autocompleter index is stored on disk using about 300MB.
>>>
>>> BR,
>>> Elmer
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--001485f87cf6d9862004a777139e--