Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 526 invoked from network); 10 Aug 2010 13:52:15 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 10 Aug 2010 13:52:15 -0000 Received: (qmail 10694 invoked by uid 500); 10 Aug 2010 13:52:13 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 10417 invoked by uid 500); 10 Aug 2010 13:52:10 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 10409 invoked by uid 99); 10 Aug 2010 13:52:09 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Aug 2010 13:52:09 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of pablomendes@gmail.com designates 209.85.215.48 as permitted sender) Received: from [209.85.215.48] (HELO mail-ew0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Aug 2010 13:52:02 +0000 Received: by ewy10 with SMTP id 10so5453866ewy.35 for ; Tue, 10 Aug 2010 06:51:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=uEoElYQNDJBpVpoQy+oMKvFY9pRzDQLU1SvQRQOUYfg=; b=hVxB+FsvAJ9cvx+IGTdxYuSSaxwqFpLk0miVOZAFctGCvs9ZIWNHLGIBPUX+DzA7wm 0p3FZHLGcwNm39jTdUIJLACNCPtrqqv22B0bhXGfbYDXCEuLuBPm6XvIPJKTgCuQtCB+ XXgDM9ZzOiaYuN4Gny+sPgb5uJ5euyZUkyUAE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=k9SXSlTre71pKcNv7gZKTkutgUTDwuJArrDnEFakm80JLkZ9Hz9CdaNfgIBYr/LBWk as2TS0dQis3hfF1lI9XQRwz0OC8MXOu4e9UCULP7avANT56ESalrNoWylg5MkfT8nzo+ splbQXPoQY1c4i+cHdn0Z7m6S+2hjOXyFRLXk= MIME-Version: 1.0 Received: by 10.213.47.76 with SMTP id m12mr13436413ebf.43.1281448300405; Tue, 10 Aug 2010 06:51:40 -0700 (PDT) Received: by 10.14.19.196 with HTTP; Tue, 10 Aug 2010 06:51:40 -0700 (PDT) In-Reply-To: <335122582-1281448159-cardhu_decombobulator_blackberry.rim.net-1093115328-@bda2121.bisx.produk.on.blackberry> References: <528709861-1281428644-cardhu_decombobulator_blackberry.rim.net-283608400-@bda2121.bisx.produk.on.blackberry> <3B4AEBB59588ED409ECE3CB6517C41750EC8C16BEC@exchange.windows.mmu.acquiremedia.com> <335122582-1281448159-cardhu_decombobulator_blackberry.rim.net-1093115328-@bda2121.bisx.produk.on.blackberry> Date: Tue, 10 Aug 2010 15:51:40 +0200 Message-ID: Subject: Re: Scaling Lucene to 1bln docs From: Pablo Mendes To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001485e8cbffbe9d23048d786e0a X-Virus-Checked: Checked by ClamAV on apache.org --001485e8cbffbe9d23048d786e0a Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Shelly, Do you mind sharing with the list the final settings you used for your best results? Cheers, Pablo On Tue, Aug 10, 2010 at 3:49 PM, anshum.gupta@naukri.com wrote: > Hey Shelly, > If you want to get more info on lucene, I'd recommend you get a copy of > lucene in action 2nd Ed. It'll help you get a hang of a lot of things! :) > > -- > Anshum > http://blog.anshumgupta.net > > Sent from BlackBerry=C2=AE > > -----Original Message----- > From: Shelly_Singh > Date: Tue, 10 Aug 2010 19:11:11 > To: java-user@lucene.apache.org > Reply-To: java-user@lucene.apache.org > Subject: RE: Scaling Lucene to 1bln docs > > Hi folks, > > Thanks for the excellent support n guidance on my very first day on this > mailing list... > At end of day, I have very optimistic results. 100bln search in less than > 1ms and the index creation time is not huge either ( close to 15 minutes)= . > > I am now hitting the 1bln mark with roughly the same settings. But, I wan= t > to understand Norms and TermFilters. > > Can someone explain, why or why not should one use each of these and what > tradeoffs does it have. > > Regards, > Shelly > > -----Original Message----- > From: Danil =C5=A2ORIN [mailto:torindan@gmail.com] > Sent: Tuesday, August 10, 2010 6:52 PM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > That won't work...if you'll have something like "A Basic Crazy > Document E-something F-something G-something....you get the point" it > will go to all shards so the whole point of shards will be > compromised...you'll have 26 billion documents index ;) > > Looks like the only way is to search all shards. > Depending on available hardware (1 Azul...50 EC2), expected > traffic(1qps...1000qps), expected query time(10 msec ... 3 sec), > redundancy (it's a large dataset, I don't think you want to loose it), > and so on...you'll have to decide how many partitions do you want. > > It may work with 8-10, it may need 50-64. (I usually use 2^n as it's > easier to split each shard in 2 when index grows too much) > > On such large datasets it's a lot of tuning, custom code, and no > one-size-fits-all solution. > Lucene is just a tool (a fine one) but you need to use it wisely to > archive great results. > > On Tue, Aug 10, 2010 at 15:55, Shelly_Singh > wrote: > > Hmm..I get the point. But, in my application, the document is basically= a > descriptive name of a particular thing. The user will search by name (or > part of name) and I need to pull out all info pointed to by that name. Th= is > info is externalized in a db. > > > > One option I can think of is- > > I can shard based on starting alphabet of any name. So, "Alan Mathur of > New Delhi" may go to shard "A". But since the name will have 'n' tokens, = and > the user may type any one token, this will not work. I can further tweak > this such that I index the same document into multiple indices (one for e= ach > token). So, the same document may be indexed into Shard"A", "M", "N" and > "D". > > I am not able to think of another option. > > > > Comments welcome. > > > > > > -----Original Message----- > > From: Danil =C5=A2ORIN [mailto:torindan@gmail.com] > > Sent: Tuesday, August 10, 2010 6:11 PM > > To: java-user@lucene.apache.org > > Subject: Re: Scaling Lucene to 1bln docs > > > > I'd second that. > > > > It doesn't have to be date for sharding. Maybe every query has some > > specific field, like UserId or something, so you can redirect to > > specific shard instead of hitting all 10 indices. > > > > You have to have some kind of narrowing: searching 1bn documents with > > queries that may hit all documents is useless. > > An user won't look on more than let say 100 results (if presented > > properly..maybe 1000) > > > > Those fields that narrow the result set are good candidates for shardin= g > keys. > > > > > > On Tue, Aug 10, 2010 at 15:32, Dan OConnor > wrote: > >> Shelly: > >> > >> You wouldn't necessarily have to use a multisearcher. A suggested > alternative is: > >> > >> - shard into 10 indices. If you need the concept of a date range searc= h, > I would assign the documents to the shard by date, otherwise random > assignment is fine. > >> - have a pool of IndexSearchers for each index > >> - when a search comes in, allocate a Searcher from each index to the > search. > >> - perform the search in parallel across all indices. > >> - merge the results in your own code using an efficient merging > algorithm. > >> > >> Regards, > >> Dan > >> > >> > >> > >> > >> -----Original Message----- > >> From: Shelly_Singh [mailto:Shelly_Singh@infosys.com] > >> Sent: Tuesday, August 10, 2010 8:20 AM > >> To: java-user@lucene.apache.org > >> Subject: RE: Scaling Lucene to 1bln docs > >> > >> No sort. I will need relevance based on TF. If I shard, I will have to > search in al indices. > >> > >> -----Original Message----- > >> From: anshum.gupta@naukri.com [mailto:anshumg@gmail.com] > >> Sent: Tuesday, August 10, 2010 1:54 PM > >> To: java-user@lucene.apache.org > >> Subject: Re: Scaling Lucene to 1bln docs > >> > >> Would like to know, are you using a particular type of sort? Do you ne= ed > to sort on relevance? Can you shard and restrict your search to a limited > set of indexes functionally? > >> > >> -- > >> Anshum > >> http://blog.anshumgupta.net > >> > >> Sent from BlackBerry(r) > >> > >> -----Original Message----- > >> From: Shelly_Singh > >> Date: Tue, 10 Aug 2010 13:31:38 > >> To: java-user@lucene.apache.org > >> Reply-To: java-user@lucene.apache.org > >> Subject: RE: Scaling Lucene to 1bln docs > >> > >> Hi Anshum, > >> > >> I am already running with the 'setCompoundFile' option off. > >> And thanks for pointing out mergeFactor. I had tried a higher > mergeFactor couple of days ago, but got an OOM, so I discarded it. Later = I > figured that OOM was because maxMergeDocs was unlimited and I was using > MMap. U r rigt, I should try a higher mergeFactor. > >> > >> With regards to the multithreaded approach, I was considering creating > 10 different threads each indexing 100mln docs coupled with a Multisearch= er > to which I will feed these 10 indices. Do you think this will improve > performance. > >> > >> And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 > hrs and search time is 15 secs.. I can live with indexing time but the > search time is highly unacceptable. > >> > >> Help again. > >> > >> -----Original Message----- > >> From: Anshum [mailto:anshumg@gmail.com] > >> Sent: Tuesday, August 10, 2010 12:55 PM > >> To: java-user@lucene.apache.org > >> Subject: Re: Scaling Lucene to 1bln docs > >> > >> Hi Shelly, > >> That seems like a reasonable data set size. I'd suggest you increase > your > >> mergeFactor as a mergeFactor of 10 says, you are only buffering 10 doc= s > in > >> memory before writing it to a file (and incurring I/O). You could > actually > >> flush by RAM usage instead of a Doc count. Turn off using the Compound > file > >> structure for indexing as it generally takes more time creating a cfs > index. > >> > >> Plus the time would not grow linearly as the larger the size of segmen= ts > >> get, the more time it'd take to add more docs and merge those together > >> intermittently. > >> You may also use a multithreaded approach in case reading the source > takes > >> time in your case, though, the indexwriter would have to be shared amo= ng > all > >> threads. > >> > >> -- > >> Anshum Gupta > >> http://ai-cafe.blogspot.com > >> > >> > >> On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh < > Shelly_Singh@infosys.com>wrote: > >> > >>> Hi, > >>> > >>> I am developing an application which uses Lucene for indexing and > searching > >>> 1 bln documents. (the document size is very small though. Each docume= nt > has > >>> a single field of 5-10 words; so I believe that my data size is withi= n > the > >>> tested limits). > >>> > >>> I am using the following configuration: > >>> 1. 1.5 gig RAM to the jvm > >>> 2. 100GB disk space. > >>> 3. Index creation tuning factors: > >>> a. mergeFactor =3D 10 > >>> b. maxFieldLength =3D 10 > >>> c. maxMergeDocs =3D 5000000 (if I try with a larger value, I get= an > >>> out-of-memory) > >>> > >>> With these settings, I am able to create an index of 100 million docs > (10 > >>> pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite > >>> satisfactory for me, but nevertheless, I want to know what else can b= e > done > >>> to tune it further. Please help. > >>> Also, with these settings, can I expect the time and size to grow > linearly > >>> for 1bln (10 pow 9) documents? > >>> > >>> Thanks and Regards, > >>> > >>> Shelly Singh > >>> Center For KNowledge Driven Information Systems, Infosys > >>> Email: shelly_singh@infosys.com > >>> Phone: (M) 91 992 369 7200, (VoIP)2022978622 > >>> > >>> > >>> > >>> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > **************** CAUTION - Disclaimer ***************** > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended > solely > for the use of the addressee(s). If you are not the intended recipient, > please > notify the sender by e-mail and delete the original message. Further, you > are not > to copy, disclose, or distribute this e-mail or its contents to any other > person and > any such actions are unlawful. This e-mail may contain viruses. Infosys h= as > taken > every reasonable precaution to minimize this risk, but is not liable for > any damage > you may sustain as a result of any virus in this e-mail. You should carry > out your > own virus checks before opening the e-mail or attachment. Infosys reserve= s > the > right to monitor and review the content of all messages sent to or from > this e-mail > address. Messages sent to or from this e-mail address may be stored on th= e > Infosys e-mail system. > ***INFOSYS******** End of Disclaimer ********INFOSYS*** > --001485e8cbffbe9d23048d786e0a--