Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 35291 invoked from network); 11 Aug 2009 23:21:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Aug 2009 23:21:57 -0000 Received: (qmail 73563 invoked by uid 500); 11 Aug 2009 23:22:02 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 73489 invoked by uid 500); 11 Aug 2009 23:22:01 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 73479 invoked by uid 99); 11 Aug 2009 23:22:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Aug 2009 23:22:01 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [209.85.210.173] (HELO mail-yx0-f173.google.com) (209.85.210.173) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Aug 2009 23:21:50 +0000 Received: by yxe3 with SMTP id 3so5911658yxe.29 for ; Tue, 11 Aug 2009 16:21:28 -0700 (PDT) MIME-Version: 1.0 Received: by 10.150.216.1 with SMTP id o1mr11584968ybg.324.1250031358832; Tue, 11 Aug 2009 15:55:58 -0700 (PDT) In-Reply-To: <88546292-CE1D-474E-9E6A-5506DEFE57C6@mac.com> References: <9cafbc680907301111t4768704cy9e4b3092b3d74b17@mail.gmail.com> <2BC00C69-41B3-48B3-83E7-CBC6DB452CCC@mac.com> <9cafbc680907311701g7c8ff0f1n896c4bed2adda9c9@mail.gmail.com> <090CF9D7-5D59-47A5-AD2A-86D4BEC6CD15@mac.com> <9ac0c6aa0908010208g757a9936vd1150bdec47bc9e9@mail.gmail.com> <483308B7-933B-4496-814C-0715D784836A@mac.com> <9ac0c6aa0908110313p23fcb7f1uc9ee1018c7ab4256@mail.gmail.com> <9ac0c6aa0908111112q64b86e46o20297da020c4fd0e@mail.gmail.com> <88546292-CE1D-474E-9E6A-5506DEFE57C6@mac.com> Date: Tue, 11 Aug 2009 18:55:58 -0400 Message-ID: <9ac0c6aa0908111555m3fdabe6dyd4c13c10c34d95ed@mail.gmail.com> Subject: Re: ThreadedIndexWriter vs. IndexWriter From: Michael McCandless To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Phew! Thank you for raising this... it was a sneaky one. Mike On Tue, Aug 11, 2009 at 4:13 PM, Jibo John wrote: > Mike, > > Yes, it works perfect ! > > I did observe a dip in the indexing throughput (1855 recs/sec vs. 2200 > recs/sec previously), but, more importantly, no data is lost this time. > > Thanks for helping me nail this down. > > -Jibo > > > > On Aug 11, 2009, at 11:12 AM, Michael McCandless wrote: > >> OK I found the problem! >> >> It was losing docs from the queue, when shutting down the thread pool, >> because we were calling super's addDocument(doc) not addDocument(doc, >> analyzer). =A0IndexWriter was simply forwarding that call to >> ThreadedIndexWriter's addDocument(doc, analyzer) which in turn would >> do nothing because the thread pool was already told to shut down. >> Larger queues made it much more likely to happen. >> >> Can you try the new version (attached)? >> >> Also, make sure you add 'doc.reuse.fields=3Dfalse' to your alg (on >> trunk). >> >> Mike >> >> On Tue, Aug 11, 2009 at 12:39 PM, Jibo John wrote: >>> >>> Mike, >>> >>> I wasn't exactly using the lucene core jar from MEAP. >>> >>> I have been building lucene from the source, and running the tests unde= r >>> lucene/java/trunk/contrib/benchmark/ (checked out 2 weeks ago, I guess) >>> =A0and, also under =A0lucene/java/tags/lucene_2_4_1/contrib/benchmark/. >>> In both cases, copied CreateThreadedIndexTask to >>> org.apache.lucene.benchmark.byTask.tasks and ThreadedIndexWriter to >>> org.apache.lucene.index. >>> >>> I have observed the issue in both the versions of lucene. >>> >>> Indexes were optimized separately using Lucli. >>> >>> >>> PFA the classes and the alg. >>> >>> >>> >>> >>> >>> >>> Thank you for your help with this one. >>> >>> -Jibo >>> >>> >>> >>> >>> On Aug 11, 2009, at 3:13 AM, Michael McCandless wrote: >>> >>>> I'm baffled why you're losing docs w/ ThreadedIndexWriter. >>>> >>>> One question: your Lucene core JAR seems to be newer than the last >>>> MEAP update. =A0Did you update it manually? >>>> >>>> Also, your indexes were optimized, but your algs don't have an >>>> optimize step -- did you separately run an optimize? >>>> >>>> Could you zip up the whole shebang (ThreadedIndexWriter.java, >>>> CreateThreadedIndexTask.java, the algs) & post? =A0Please CC me direct= ly >>>> so I can grab the zip file... thanks. >>>> >>>> Mike >>>> >>>> On Mon, Aug 3, 2009 at 12:37 PM, Jibo John wrote: >>>>> >>>>> Mike, >>>>> >>>>> Verified that I have the latest source code. >>>>> Here are the alg files and the checkindexer output. >>>>> >>>>> >>>>> ----------------------------------------- indexwriter >>>>> alg---------------------------------------------------------------- >>>>> >>>>> analyzer=3Dorg.apache.lucene.analysis.standard.StandardAnalyzer >>>>> doc.maker=3Dorg.apache.lucene.benchmark.byTask.feeds.LineDocMaker >>>>> directory=3DFSDirectory >>>>> >>>>> doc.stored =3D true =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#A >>>>> docs.file=3Dwikipedia.lines.txt >>>>> ram.flush.mb=3D50 >>>>> compound=3Dfalse >>>>> merge.factor=3D5 >>>>> doc.add.log.step=3D1000 >>>>> doc.term.vector=3Dfalse >>>>> doc.term.vector.positions=3Dfalse >>>>> doc.term.vector.offsets=3Dfalse >>>>> >>>>> { "Rounds" =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 #B >>>>> =A0ResetSystemErase >>>>> =A0{ "BuildIndex" >>>>> =A0-CreateIndex() >>>>> =A0[ { "AddDocs" AddDoc > : 40000 ] : 5 >>>>> =A0#C >>>>> =A0-CloseIndex() >>>>> =A0} >>>>> =A0NewRound >>>>> } : 1 >>>>> >>>>> RepSumByPrefRound BuildIndex =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 #D >>>>> >>>>> -----------------------------------------threadedindexwriter alg >>>>> ---------------------------------------------------------------- >>>>> >>>>> analyzer=3Dorg.apache.lucene.analysis.standard.StandardAnalyzer >>>>> doc.maker=3Dorg.apache.lucene.benchmark.byTask.feeds.LineDocMaker >>>>> directory=3DFSDirectory >>>>> >>>>> doc.stored =3D true =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#A >>>>> docs.file=3Dwikipedia.lines.txt >>>>> ram.flush.mb=3D50 >>>>> compound=3Dfalse >>>>> merge.factor=3D5 >>>>> doc.add.log.step=3D1000 >>>>> doc.term.vector=3Dfalse >>>>> doc.term.vector.positions=3Dfalse >>>>> doc.term.vector.offsets=3Dfalse >>>>> writer.num.threads=3D15 >>>>> writer.max.thread.queue.size=3D75 >>>>> work.dir=3Dwork_t >>>>> >>>>> >>>>> { "Rounds" =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 #B >>>>> =A0ResetSystemErase >>>>> =A0{ "BuildIndex" >>>>> =A0-CreateThreadedIndex() >>>>> =A0{ "AddDocs" AddDoc > : 200000 >>>>> =A0-CloseIndex() >>>>> =A0} >>>>> =A0NewRound >>>>> } : 1 >>>>> >>>>> RepSumByPrefRound BuildIndex =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 #D >>>>> >>>>> >>>>> -----------------------------------------------threadedindexwriter >>>>> checkindex ---------------------------------------------------------- >>>>> >>>>> >>>>> $ java -classpath >>>>> >>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev= .jar >>>>> org.apache.lucene.index.CheckIndex >>>>> >>>>> >>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/= index >>>>> >>>>> NOTE: testing will be more thorough if you run java with >>>>> '-ea:org.apache.lucene...', so assertions are enabled >>>>> >>>>> Opening index @ >>>>> >>>>> >>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/= index >>>>> >>>>> Segments file=3Dsegments_3 numSegments=3D1 version=3DFORMAT_DIAGNOSTI= CS >>>>> [Lucene >>>>> 2.9] >>>>> =A01 of 1: name=3D_p docCount=3D199941 >>>>> =A0compound=3Dtrue >>>>> =A0hasProx=3Dtrue >>>>> =A0numFiles=3D3 >>>>> =A0size (MB)=3D317.1 >>>>> =A0diagnostics =3D {java.version=3D1.5.0_19, lucene.version=3D2.9-dev= 779767M - >>>>> 2009-05-28 17:02:17, os=3DMac OS X, os.arch=3Di386, optimize=3Dtrue, >>>>> mergeDocStores=3Dfalse, java.vendor=3DApple Inc., os.version=3D10.5.7= , >>>>> source=3Dmerge, mergeFactor=3D5} >>>>> =A0docStoreOffset=3D0 >>>>> =A0docStoreSegment=3D_0 >>>>> =A0docStoreIsCompoundFile=3Dfalse >>>>> =A0no deletions >>>>> =A0test: open reader.........OK >>>>> =A0test: fields, norms.......OK [4 fields] >>>>> =A0test: terms, freq, prox...OK [1269552 terms; 67887116 terms/docs >>>>> pairs; >>>>> 133241176 tokens] >>>>> =A0test: stored fields.......OK [199941 total field count; avg 1 fiel= ds >>>>> per >>>>> doc] >>>>> =A0test: term vectors........OK [0 total vector count; avg 0 term/fre= q >>>>> vector >>>>> fields per doc] >>>>> >>>>> No problems were detected with this index. >>>>> >>>>> ------------------------------------------indexwriter checkindex >>>>> --------------------------------------------------------------- >>>>> >>>>> $ java -classpath >>>>> >>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev= .jar >>>>> org.apache.lucene.index.CheckIndex >>>>> >>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/in= dex >>>>> >>>>> NOTE: testing will be more thorough if you run java with >>>>> '-ea:org.apache.lucene...', so assertions are enabled >>>>> >>>>> Opening index @ >>>>> >>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/in= dex >>>>> >>>>> Segments file=3Dsegments_a numSegments=3D1 version=3DFORMAT_DIAGNOSTI= CS >>>>> [Lucene >>>>> 2.9] >>>>> =A01 of 1: name=3D_18 docCount=3D200000 >>>>> =A0compound=3Dtrue >>>>> =A0hasProx=3Dtrue >>>>> =A0numFiles=3D1 >>>>> =A0size (MB)=3D427.445 >>>>> =A0diagnostics =3D {java.version=3D1.5.0_19, lucene.version=3D2.9-dev= 779767M - >>>>> 2009-05-28 17:02:17, os=3DMac OS X, os.arch=3Di386, optimize=3Dtrue, >>>>> mergeDocStores=3Dtrue, java.vendor=3DApple Inc., os.version=3D10.5.7, >>>>> source=3Dmerge, mergeFactor=3D4} >>>>> =A0no deletions >>>>> =A0test: open reader.........OK >>>>> =A0test: fields, norms.......OK [4 fields] >>>>> =A0test: terms, freq, prox...OK [3512343 terms; 80020204 terms/docs >>>>> pairs; >>>>> 163219760 tokens] >>>>> =A0test: stored fields.......OK [200000 total field count; avg 1 fiel= ds >>>>> per >>>>> doc] >>>>> =A0test: term vectors........OK [0 total vector count; avg 0 term/fre= q >>>>> vector >>>>> fields per doc] >>>>> >>>>> No problems were detected with this index. >>>>> >>>>> >>>>> >>>>> ---------------------------------------------------------------------= ------------------------------------ >>>>> >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> -Jibo >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Aug 1, 2009, at 2:08 AM, Michael McCandless wrote: >>>>> >>>>>> (Please note that ThreadedIndexWriter is source code available with >>>>>> the upcoming revision to Lucene in Action.) >>>>>> >>>>>> Phil, is it possible you are using an older version of the book's >>>>>> source code? =A0In particular, can you check whether your version of >>>>>> ThreadedIndexWriter.java has this: >>>>>> >>>>>> =A0public void close(boolean doWait) throws CorruptIndexException, >>>>>> IOException { >>>>>> =A0finish(); >>>>>> =A0super.close(doWait); >>>>>> =A0} >>>>>> >>>>>> (I vaguely remember that being missing from earlier releases, which >>>>>> could explain what you're seeing). =A0If you are missing that, can y= ou >>>>>> download the current code from http://www.manning.com/hatcher3 and t= ry >>>>>> again? >>>>>> >>>>>> If that's not the problem... can you post the benchmark alg you are >>>>>> using in each case? >>>>>> >>>>>> Mike >>>>>> >>>>>> On Fri, Jul 31, 2009 at 8:26 PM, Jibo John wrote: >>>>>>> >>>>>>> Hi Phil, >>>>>>> >>>>>>> It's 5 threads for IndexWriter. >>>>>>> >>>>>>> For ThreadedIndexWriter, I used: >>>>>>> >>>>>>> writer.num.threads=3D16 >>>>>>> writer.max.thread.queue.size=3D80 >>>>>>> >>>>>>> Thanks, >>>>>>> -Jibo >>>>>>> >>>>>>> On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote: >>>>>>> >>>>>>>> Hi Jibo, >>>>>>>> >>>>>>>> Your mergeFactor is different, and the resulting numFiles (segment >>>>>>>> files) is different. Maybe each thread is responsible for a segmen= t >>>>>>>> file. Just curious - do you have 3 threads? >>>>>>>> >>>>>>>> Phil >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------------= --- >>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------= -- >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>>>> >>>>>>> >>>>>> >>>>>> --------------------------------------------------------------------= - >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>> >>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>> >>> >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org