Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 96140 invoked from network); 30 Oct 2009 00:50:13 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 30 Oct 2009 00:50:13 -0000 Received: (qmail 45257 invoked by uid 500); 30 Oct 2009 00:50:11 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 45163 invoked by uid 500); 30 Oct 2009 00:50:11 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 45153 invoked by uid 99); 30 Oct 2009 00:50:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Oct 2009 00:50:11 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of markrmiller@gmail.com designates 209.85.217.219 as permitted sender) Received: from [209.85.217.219] (HELO mail-gx0-f219.google.com) (209.85.217.219) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Oct 2009 00:50:08 +0000 Received: by gxk19 with SMTP id 19so2483815gxk.5 for ; Thu, 29 Oct 2009 17:49:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :x-enigmail-version:content-type:content-transfer-encoding; bh=C0T0gSSHWa/fOiX2WjZcOhu1PSGMOi0F6sETmUVRHBs=; b=aKP/iQJgZcVkSOp67qw5eGnrIRfIwC9xO10in/MnCtoqqedAccLqJV3Wmqv81mo5mA O3W9FGC4WHnuP13RZA2IQDJsvqm+w0luJVl9hfMx6/Lq+wbbC4aeLETR+C/ktNehUEZw X7BvmpqTMTX9hRl8sugq0r5Y7McoTt+7lk7Hk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:x-enigmail-version:content-type :content-transfer-encoding; b=Dm1CBQldu1gV+rz3DZZhgv1nLJ8WkjCdO2h/7EOyU5i1hqi0EdlAnAs2y28/8ry70l KhX4ZkRe5Mn9l2+e3WljbCukMDr6J5t0zLvblpHdE5bDi0bgqd+ToIl5JB4aFeiO0p5p DqnqDiTYpeMMn2Rh3efmJhIU9L5wLWuebNMOI= Received: by 10.150.38.5 with SMTP id l5mr1601444ybl.284.1256863787348; Thu, 29 Oct 2009 17:49:47 -0700 (PDT) Received: from ?192.168.1.108? (ool-44c639d9.dyn.optonline.net [68.198.57.217]) by mx.google.com with ESMTPS id 15sm331479gxk.0.2009.10.29.17.49.45 (version=SSLv3 cipher=RC4-MD5); Thu, 29 Oct 2009 17:49:46 -0700 (PDT) Message-ID: <4AEA3828.6060309@gmail.com> Date: Thu, 29 Oct 2009 20:49:44 -0400 From: Mark Miller User-Agent: Thunderbird 2.0.0.23 (X11/20090817) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: IO exception during merge/optimize References: <9ac0c6aa0910280743vc1cf36dkdb6a9558a022d6d5@mail.gmail.com> <9ac0c6aa0910280829r3f4f167ava6a41c983353c392@mail.gmail.com> <9ac0c6aa0910281029p6ac2729csa85ca832057e07e0@mail.gmail.com> <4AE9FB61.6030009@gmail.com> In-Reply-To: X-Enigmail-Version: 0.96.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Thanks a lot Peter! Really appreciate it. Peter Keegan wrote: > Mark, > > With 1.9G, I had to increase the JVM heap significantly (to 8G) to avoid > paging and GC hits. Here is a table comparing indexing times, optimizing > times and peak memory usage as a function of the RAMBufferSize. This was > run on a 64-bit server with 32GB RAM: > > RamSize Index(min) Optimize(min) Max VM > 1.9G 24 5 5G > 800M 24 5 4G > > Not much difference. I'll make a couple more runs with lower values. > Btw, the indexing times are really about 5 min. shorter because of some > non-Lucene related delays after the last document. > > Peter > > > > On Thu, Oct 29, 2009 at 4:30 PM, Mark Miller wrote: > > >> Any chance I could get you to try that again with a buffer of like 800MB >> to a gig and do a comparison? >> >> I've been investigating the returns you get with a larger buffer size. >> It appears to be pretty diminishing returns over 100MB or so - at higher >> than that, I've gotten both slower speeds for some sizes, and larger >> gains for others. But only better by 5-10 docs a second up to a gig. But >> I can't reliably test at over a gig - I have only 4 GB of RAM, and even >> with that, at over a gig it starts to page and the performance gets hit. >> I'd love to see what kind of benefit you see going from around a gig to >> just under 2. >> >> Peter Keegan wrote: >> >>> Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with >>> optimization in just under 30 min. >>> I used setRAMBufferSizeMB=1.9G >>> >>> Peter >>> >>> On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan >> wrote: >>> >>> >>> >>>> A handful of the source documents did contain the U+FFFF character. The >>>> patch from *LUCENE-2016< >>>> >> https://issues.apache.org/jira/browse/LUCENE-2016> >> >>>> *fixed the problem. >>>> Thanks Mike! >>>> >>>> Peter >>>> >>>> >>>> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless < >>>> lucene@mikemccandless.com> wrote: >>>> >>>> >>>> >>>>> Hmm, only a few affected terms, and all this particular >>>>> "literals:cfid196$" term, with optional suffixes. Really strange. >>>>> >>>>> One things that's odd is the exact term "literals:cfid196$" is printed >>>>> twice, which should never happen (every unique term should be stored >>>>> only once, in the terms dict). >>>>> >>>>> And, otherwise, CheckIndex got through the index just fine. >>>>> >>>>> Try searching a TermQuery with these affected terms and see if it >>>>> succeeds? If so, maybe trying making an index with one or two of >>>>> them, alone, and see if that index shows the problem? >>>>> >>>>> OK I'm attaching more mods. Can you re-run your CheckIndex? It will >>>>> produce an enormous amount of output, but if you can excise the few >>>>> lines around when that warning comes out & post back that'd be great. >>>>> >>>>> Mike >>>>> >>>>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan >>>> >>>>> wrote: >>>>> >>>>> >>>>>> Just to be safe, I ran with the official jar file from one of the >>>>>> >>>>>> >>>>> mirrors >>>>> >>>>> >>>>>> and reproduced the problem. >>>>>> The debug session is not showing any characters = '\uffff' (checking >>>>>> >>>>>> >>>>> this in >>>>> >>>>> >>>>>> Tokenizer). >>>>>> The output from the modified CheckIndex follows. There are only a few >>>>>> >>>>>> >>>>> terms >>>>> >>>>> >>>>>> with the inconsistency. They are all legitimate terms from the app's >>>>>> context. With this info, I might be able to isolate the source >>>>>> >>>>>> >>>>> documents. >>>>> >>>>> >>>>>> What should I be looking for when they are indexed? >>>>>> >>>>>> CheckInput output: >>>>>> >>>>>> Opening index @ >>>>>> >>>>>> >>>>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4 >>>>> >>>>> >>>>>> Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS >>>>>> >>>>>> >>>>> [Lucene >>>>> >>>>> >>>>>> 2.9] >>>>>> 1 of 3: name=_0 docCount=413585 >>>>>> compound=false >>>>>> hasProx=true >>>>>> numFiles=8 >>>>>> size (MB)=1,148.817 >>>>>> diagnostics = {os.version=5.2, os=Windows 2003, >>>>>> >> lucene.version=2.9.0 >> >>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >>>>>> docStoreOffset=0 >>>>>> docStoreSegment=_0 >>>>>> docStoreIsCompoundFile=false >>>>>> no deletions >>>>>> test: open reader.........OK >>>>>> test: fields..............OK [33 fields] >>>>>> test: field norms.........OK [33 fields] >>>>>> test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs >>>>>> >>>>>> >>>>> pairs; >>>>> >>>>> >>>>>> 340244234 tokens] >>>>>> test: stored fields.......OK [1240755 total field count; avg 3 >>>>>> >> fields >> >>>>>> per doc] >>>>>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>>>>> vector fields per doc] >>>>>> >>>>>> 2 of 3: name=_1 docCount=359068 >>>>>> compound=false >>>>>> hasProx=true >>>>>> numFiles=8 >>>>>> size (MB)=1,125.161 >>>>>> diagnostics = {os.version=5.2, os=Windows 2003, >>>>>> >> lucene.version=2.9.0 >> >>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >>>>>> docStoreOffset=413585 >>>>>> docStoreSegment=_0 >>>>>> docStoreIsCompoundFile=false >>>>>> no deletions >>>>>> test: open reader.........OK >>>>>> test: fields..............OK [33 fields] >>>>>> test: field norms.........OK [33 fields] >>>>>> test: terms, freq, prox...WARNING: term literals:cfid196$ >>>>>> >> docFreq=43 >> >>>>> != >>>>> >>>>> >>>>>> num docs seen 4 + num docs deleted 0 >>>>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num >>>>>> >> docs >> >>>>>> deleted 0 >>>>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num >>>>>> >> docs >> >>>>>> deleted 0 >>>>>> WARNING: term literals:cfid196$commandant docFreq=1 != num docs seen >>>>>> >> 9 >> >>>>> + >>>>> >>>>> >>>>>> num docs deleted 0 >>>>>> WARNING: term literals:cfid196$on docFreq=3178 != num docs seen 1 + >>>>>> >> num >> >>>>>> docs deleted 0 >>>>>> OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens] >>>>>> test: stored fields.......OK [1077204 total field count; avg 3 >>>>>> >> fields >> >>>>>> per doc] >>>>>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>>>>> vector fields per doc] >>>>>> >>>>>> 3 of 3: name=_2 docCount=304849 >>>>>> compound=false >>>>>> hasProx=true >>>>>> numFiles=8 >>>>>> size (MB)=962.004 >>>>>> diagnostics = {os.version=5.2, os=Windows 2003, >>>>>> >> lucene.version=2.9.0 >> >>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >>>>>> docStoreOffset=772653 >>>>>> docStoreSegment=_0 >>>>>> docStoreIsCompoundFile=false >>>>>> no deletions >>>>>> test: open reader.........OK >>>>>> test: fields..............OK [33 fields] >>>>>> test: field norms.........OK [33 fields] >>>>>> test: terms, freq, prox...WARNING: term contents:? docFreq=1 != >>>>>> >> num >> >>>>>> docs seen 246 + num docs deleted 0 >>>>>> WARNING: term literals:cfid196$ docFreq=45 != num docs seen 4 + num >>>>>> >>>>>> >>>>> docs >>>>> >>>>> >>>>>> deleted 0 >>>>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num >>>>>> >> docs >> >>>>>> deleted 0 >>>>>> WARNING: term literals:cfid196$cashier docFreq=1 != num docs seen 37 >>>>>> >> + >> >>>>> num >>>>> >>>>> >>>>>> docs deleted 0 >>>>>> WARNING: term literals:cfid196$interrogation docFreq=181 != num docs >>>>>> >>>>>> >>>>> seen 1 >>>>> >>>>> >>>>>> + num docs deleted 0 >>>>>> WARNING: term literals:cfid196$leader docFreq=1 != num docs seen 353 >>>>>> >> + >> >>>>> num >>>>> >>>>> >>>>>> docs deleted 0 >>>>>> WARNING: term literals:cfid196$microsoft docFreq=3114 != num docs >>>>>> >> seen >> >>>>> 1 + >>>>> >>>>> >>>>>> num docs deleted 0 >>>>>> WARNING: term literals:cfid196$nt docFreq=200 != num docs seen 1 + >>>>>> >> num >> >>>>> docs >>>>> >>>>> >>>>>> deleted 0 >>>>>> OK [6497769 terms; 145296880 terms/docs pairs; 293458734 tokens] >>>>>> test: stored fields.......OK [914547 total field count; avg 3 >>>>>> >> fields >> >>>>> per >>>>> >>>>> >>>>>> doc] >>>>>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>>>>> vector fields per doc] >>>>>> >>>>>> No problems were detected with this index. >>>>>> >>>>>> Peter >>>>>> >>>>>> >>>>>> On Wed, Oct 28, 2009 at 11:29 AM, Michael McCandless < >>>>>> lucene@mikemccandless.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan < >>>>>>> >> peterlkeegan@gmail.com >> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>>> The only change I made to the source code was the patch for >>>>>>>> >>>>>>>> >>>>>>> PayloadNearQuery >>>>>>> >>>>>>> >>>>>>>> (LUCENE-1986). >>>>>>>> >>>>>>>> >>>>>>> That patch certainly shouldn't lead to this. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> It's possible that our content contains U+FFFF. I will run in >>>>>>>> >>>>>>>> >>>>> debugger >>>>> >>>>> >>>>>>> and >>>>>>> >>>>>>> >>>>>>>> see. >>>>>>>> >>>>>>>> >>>>>>> OK may as well check just so we cover all possibilities. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> The data is 'sensitive', so I may not be able to provide a bad >>>>>>>> >>>>>>>> >>>>> segment, >>>>> >>>>> >>>>>>>> unfortunately. >>>>>>>> >>>>>>>> >>>>>>> OK, maybe we can modify your CheckIndex instead. Let's start with >>>>>>> this, which prints a warning whenever the docFreq differs but >>>>>>> otherwise continues (vs throwing RuntimeException). I'm curious how >>>>>>> many terms show this, and whether the TermEnum keeps working after >>>>>>> this term that has different docFreq: >>>>>>> >>>>>>> Index: src/java/org/apache/lucene/index/CheckIndex.java >>>>>>> =================================================================== >>>>>>> --- src/java/org/apache/lucene/index/CheckIndex.java (revision >>>>>>> >>>>>>> >>>>> 829889) >>>>> >>>>> >>>>>>> +++ src/java/org/apache/lucene/index/CheckIndex.java (working >>>>>>> >> copy) >> >>>>>>> @@ -672,8 +672,8 @@ >>>>>>> } >>>>>>> >>>>>>> if (freq0 + delCount != docFreq) { >>>>>>> - throw new RuntimeException("term " + term + " docFreq=" + >>>>>>> - docFreq + " != num docs seen " >>>>>>> >> + >> >>>>>>> freq0 + " + num docs deleted " + delCount); >>>>>>> + System.out.println("WARNING: term " + term + " docFreq=" >>>>>>> >> + >> >>>>>>> + docFreq + " != num docs seen " + freq0 >>>>>>> >> + >> >>>>>>> " + num docs deleted " + delCount); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> Mike >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>> >> -- >> - Mark >> >> http://www.lucidimagination.com >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> > > -- - Mark http://www.lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org