Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1EC784C3D for ; Thu, 7 Jul 2011 09:10:48 +0000 (UTC) Received: (qmail 5205 invoked by uid 500); 7 Jul 2011 09:10:45 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 4444 invoked by uid 500); 7 Jul 2011 09:10:23 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 4418 invoked by uid 99); 7 Jul 2011 09:10:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Jul 2011 09:10:09 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dawid.weiss@gmail.com designates 209.85.218.48 as permitted sender) Received: from [209.85.218.48] (HELO mail-yi0-f48.google.com) (209.85.218.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Jul 2011 09:10:03 +0000 Received: by yic24 with SMTP id 24so432211yic.35 for ; Thu, 07 Jul 2011 02:09:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=HzaOVo78oxILFij/0a0F23+lG2rXow6Kc+TGiN51NJQ=; b=KPdvq4qSFYoVzoqohPjEEJmGoRWpiy8YzwQrJcO6afP4GMIqmWp1xM8ZCCbeX+sGI2 HR3yNRMv3i5KLOXiO/UsdAHTejCVH0VYLk81rPlXv/Bbgf1ITvACB3Yz3WgcQvdmhVF+ gPcTLvhTr4q/CIWqL/XRVlGw6BSf9bzt+57nk= MIME-Version: 1.0 Received: by 10.91.121.16 with SMTP id y16mr717495agm.135.1310029782811; Thu, 07 Jul 2011 02:09:42 -0700 (PDT) Received: by 10.90.155.5 with HTTP; Thu, 7 Jul 2011 02:09:42 -0700 (PDT) Received: by 10.90.155.5 with HTTP; Thu, 7 Jul 2011 02:09:42 -0700 (PDT) In-Reply-To: References: <1309968498.25963.17.camel@elmer-P35-DS3P> <5D6C36CFCB0B4AF38BEAC2E0240D22E5@ElmerPC> Date: Thu, 7 Jul 2011 11:09:42 +0200 Message-ID: Subject: Re: Autocompletion on large index From: Dawid Weiss To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001485f87cf6d9862004a777139e --001485f87cf6d9862004a777139e Content-Type: text/plain; charset=UTF-8 Elmer. Tst will have a large overhead. Fst may not be that much better if your input has very few shared pre or suffixes. In your case i think this is unfortunately true. What i would do is create a regular lucene index and store it on disk. Then run prefix queries on it. Should work and scale to large number of ops per sec. See lucene revolution 2011 talks - there was a talk about using just this instead of a completion module. Like mike said though, it'd be interesting to investigate on your data. On Jul 6, 2011 8:52 PM, "Elmer" wrote: > I just profiled the application and tst.TernaryTreeNode takes 99.99..% of > the memory. > > I'll test further tomorrow and report on mem usage for runnable smaller > indexes. > I will email you privately for sharing the index to work with. > > BR, > Elmer > > > -----Oorspronkelijk bericht----- > From: Michael McCandless > Sent: Wednesday, July 06, 2011 8:39 PM > To: java-user@lucene.apache.org > Subject: Re: Autocompletion on large index > > Hmm... so I suspect the fst suggest module must first gather up all > titles, then sort them, in RAM, and then build the actual FST. Maybe > it's this gather + sort that's taking so much RAM? > > 1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So > that shouldn't be it... > > Is this a an accessible corpus? Can I somehow get a copy to play with...? > > Are you able to [temporarily, once] build the full FST and other > suggest impls and compare how much RAM is required for building and > then lookups? > > Mike McCandless > > http://blog.mikemccandless.com > > On Wed, Jul 6, 2011 at 1:50 PM, Elmer wrote: >> Hi Mike, >> >> That's what I thought when I started indexing it. To be clear, it happens >> on >> build time. >> I don't know if memory efficiency is better when building has finished. >> >> The titles I index are titles from the dblp computer sience bibliography. >> They can take up to... say 100 characters. >> Examples: >> ------- >> - Auditory stimulus optimization with feedback from fuzzy clustering of >> neuronal responses >> - Two-objective method for crisp and fuzzy interval comparison in >> optimization >> - Bound Constrained Smooth Optimization for Solving Variational >> Inequalities >> and Related Problems >> - Retrieval of bibliographic records using Apache Lucene >> - Digital Library Information Appliances >> ------- >> >> The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter in >> that order. >> >> I also tried to do the same for the author names, and this works without >> problems. Actually it builds the tree/fsa/... faster from dictionary than >> from file (the lookup data file that can be stored and loaded through the >> .store and .load methods). But the larger set of publication titles is >> currently no-go with 2.5GB of heapspace, only having a main class that >> builds the LookUp data. >> >> BR, >> Elmer >> >> >> -----Oorspronkelijk bericht----- From: Michael McCandless >> Sent: Wednesday, July 06, 2011 6:23 PM >> To: java-user@lucene.apache.org >> Subject: Re: Autocompletion on large index >> >> You could try storing your autocomplete index in a RAMDirectory? >> >> But: I'm surprised you see the FST suggest impl using up so much RAM; >> very low memory usage is one of the strengths of the FST approach. >> Can you share the text (titles) you are feeding to the suggest module? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Wed, Jul 6, 2011 at 12:08 PM, Elmer wrote: >>> >>> Hi again. >>> >>> I have created my own autocompleter based on the spellchecker. This >>> works well in a sense that it is able to create an auto completion index >>> from my 'publication' index. However, integrated in my web application, >>> each keypress asks autocompleter to search the index, which is stored on >>> disk (not in mem), just like spellchecker does (except that spellchecker >>> is not invoked every keypress). >>> With Lucene 3.3.0, auto completion modules are included, which load >>> their trees/fsa/... in memory. I'd like to use these modules, but the >>> problem is that they use more than 2.5GB, causing heap space exceptions. >>> This happens when I try to build a LookUp index (fst,jaspell or tst, >>> doesn't matter) from my 'publication' index consisting of 1.3M >>> publications. The field I use for autocompletion holds the titles of the >>> publications indexed untokenized (but lowercased). >>> >>> Code: >>> Lookup autoCompleter = new TSTLookup(); >>> FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX")); >>> LuceneDictionary dict = new >>> LuceneDictionary(IndexReader.open(dir),"title_suggest"); >>> autoCompleter.build(dict); >>> >>> Is it possible to have the autocompletion module to work in-memory on >>> such a dataset without increasing java's heapspace? >>> FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where >>> my own autocompleter index is stored on disk using about 300MB. >>> >>> BR, >>> Elmer >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --001485f87cf6d9862004a777139e--