Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of markrmiller@gmail.com
 designates 209.85.217.219 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:user-agent:mime-version:to:subject:references
         :in-reply-to:x-enigmail-version:content-type
         :content-transfer-encoding;
        b=Dm1CBQldu1gV+rz3DZZhgv1nLJ8WkjCdO2h/7EOyU5i1hqi0EdlAnAs2y28/8ry70l
         KhX4ZkRe5Mn9l2+e3WljbCukMDr6J5t0zLvblpHdE5bDi0bgqd+ToIl5JB4aFeiO0p5p
         DqnqDiTYpeMMn2Rh3efmJhIU9L5wLWuebNMOI=
Message-ID: <4AEA3828.6060309@gmail.com>
Date: Thu, 29 Oct 2009 20:49:44 -0400
From: Mark Miller <markrmiller@gmail.com>
User-Agent: Thunderbird 2.0.0.23 (X11/20090817)
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Re: IO exception during merge/optimize
References: <e994873a0910241420o39422fc7p4318d821ddef8a90@mail.gmail.com>
	 <e994873a0910280721r3641821fl4cf9fb90716d4a6a@mail.gmail.com>
	 <9ac0c6aa0910280743vc1cf36dkdb6a9558a022d6d5@mail.gmail.com>
	 <e994873a0910280758h67f734a7k397d7c418fc807fb@mail.gmail.com>
	 <9ac0c6aa0910280829r3f4f167ava6a41c983353c392@mail.gmail.com>
	 <e994873a0910280923l6b73ec26nd01f112f9d7c5b2c@mail.gmail.com>
	 <9ac0c6aa0910281029p6ac2729csa85ca832057e07e0@mail.gmail.com>
	 <e994873a0910291246v2b03a000s9248aad2123f40ce@mail.gmail.com>
	 <e994873a0910291255r5d3421f8g56d8a380ffb3841f@mail.gmail.com>
	 <4AE9FB61.6030009@gmail.com>
 <e994873a0910291606i3118dab2ge9d85027134b1f8a@mail.gmail.com>
In-Reply-To: <e994873a0910291606i3118dab2ge9d85027134b1f8a@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Thanks a lot Peter! Really appreciate it.

Peter Keegan wrote:
> Mark,
>
> With 1.9G, I had to increase the JVM heap significantly (to 8G)  to avoid
> paging and GC hits. Here is a table comparing indexing times, optimizing
> times and peak memory usage as a function of the  RAMBufferSize. This was
> run on a 64-bit server with 32GB RAM:
>
> RamSize        Index(min)    Optimize(min)     Max VM
> 1.9G         24            5                   5G
> 800M        24            5                  4G
>
> Not much difference. I'll make a couple more runs with lower values.
> Btw, the indexing times are really about 5 min. shorter because of some
> non-Lucene related delays after the last document.
>
> Peter
>
>
>
> On Thu, Oct 29, 2009 at 4:30 PM, Mark Miller <markrmiller@gmail.com> wrote:
>
>   
>> Any chance I could get you to try that again with a buffer of like 800MB
>> to a gig and do a comparison?
>>
>> I've been investigating the returns you get with a larger buffer size.
>> It appears to be pretty diminishing returns over 100MB or so - at higher
>> than that, I've gotten both slower speeds for some sizes, and larger
>> gains for others. But only better by 5-10 docs a second up to a gig. But
>> I can't reliably test at over a gig - I have only 4 GB of RAM, and even
>> with that, at over a gig it starts to page and the performance gets hit.
>> I'd love to see what kind of benefit you see going from around a gig to
>> just under 2.
>>
>> Peter Keegan wrote:
>>     
>>> Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with
>>> optimization in just under 30 min.
>>> I used setRAMBufferSizeMB=1.9G
>>>
>>> Peter
>>>
>>> On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan <peterlkeegan@gmail.com
>>> wrote:
>>>
>>>
>>>       
>>>> A handful of the source documents did contain the U+FFFF character. The
>>>> patch from *LUCENE-2016<
>>>>         
>> https://issues.apache.org/jira/browse/LUCENE-2016>
>>     
>>>> *fixed the problem.
>>>> Thanks Mike!
>>>>
>>>> Peter
>>>>
>>>>
>>>> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless <
>>>> lucene@mikemccandless.com> wrote:
>>>>
>>>>
>>>>         
>>>>> Hmm, only a few affected terms, and all this particular
>>>>> "literals:cfid196$" term, with optional suffixes.  Really strange.
>>>>>
>>>>> One things that's odd is the exact term "literals:cfid196$" is printed
>>>>> twice, which should never happen (every unique term should be stored
>>>>> only once, in the terms dict).
>>>>>
>>>>> And, otherwise, CheckIndex got through the index just fine.
>>>>>
>>>>> Try searching a TermQuery with these affected terms and see if it
>>>>> succeeds?  If so, maybe trying making an index with one or two of
>>>>> them, alone, and see if that index shows the problem?
>>>>>
>>>>> OK I'm attaching more mods.  Can you re-run your CheckIndex?  It will
>>>>> produce an enormous amount of output, but if you can excise the few
>>>>> lines around when that warning comes out & post back that'd be great.
>>>>>
>>>>> Mike
>>>>>
>>>>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan <peterlkeegan@gmail.com
>>>>>           
>>>>> wrote:
>>>>>
>>>>>           
>>>>>> Just to be safe, I ran with the official jar file from one of the
>>>>>>
>>>>>>             
>>>>> mirrors
>>>>>
>>>>>           
>>>>>> and reproduced the problem.
>>>>>> The debug session is not showing any characters = '\uffff' (checking
>>>>>>
>>>>>>             
>>>>> this in
>>>>>
>>>>>           
>>>>>> Tokenizer).
>>>>>> The output from the modified CheckIndex follows. There are only a few
>>>>>>
>>>>>>             
>>>>> terms
>>>>>
>>>>>           
>>>>>> with the inconsistency. They are all legitimate terms from the app's
>>>>>> context. With this info, I might be able to isolate the source
>>>>>>
>>>>>>             
>>>>> documents.
>>>>>
>>>>>           
>>>>>> What should I be looking for when they are indexed?
>>>>>>
>>>>>> CheckInput output:
>>>>>>
>>>>>> Opening index @
>>>>>>
>>>>>>             
>>>>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4
>>>>>
>>>>>           
>>>>>> Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS
>>>>>>
>>>>>>             
>>>>> [Lucene
>>>>>
>>>>>           
>>>>>> 2.9]
>>>>>>  1 of 3: name=_0 docCount=413585
>>>>>>    compound=false
>>>>>>    hasProx=true
>>>>>>    numFiles=8
>>>>>>    size (MB)=1,148.817
>>>>>>    diagnostics = {os.version=5.2, os=Windows 2003,
>>>>>>             
>> lucene.version=2.9.0
>>     
>>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>>>>>>    docStoreOffset=0
>>>>>>    docStoreSegment=_0
>>>>>>    docStoreIsCompoundFile=false
>>>>>>    no deletions
>>>>>>    test: open reader.........OK
>>>>>>    test: fields..............OK [33 fields]
>>>>>>    test: field norms.........OK [33 fields]
>>>>>>    test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs
>>>>>>
>>>>>>             
>>>>> pairs;
>>>>>
>>>>>           
>>>>>> 340244234 tokens]
>>>>>>    test: stored fields.......OK [1240755 total field count; avg 3
>>>>>>             
>> fields
>>     
>>>>>> per doc]
>>>>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>>>> vector fields per doc]
>>>>>>
>>>>>>  2 of 3: name=_1 docCount=359068
>>>>>>    compound=false
>>>>>>    hasProx=true
>>>>>>    numFiles=8
>>>>>>    size (MB)=1,125.161
>>>>>>    diagnostics = {os.version=5.2, os=Windows 2003,
>>>>>>             
>> lucene.version=2.9.0
>>     
>>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>>>>>>    docStoreOffset=413585
>>>>>>    docStoreSegment=_0
>>>>>>    docStoreIsCompoundFile=false
>>>>>>    no deletions
>>>>>>    test: open reader.........OK
>>>>>>    test: fields..............OK [33 fields]
>>>>>>    test: field norms.........OK [33 fields]
>>>>>>    test: terms, freq, prox...WARNING: term  literals:cfid196$
>>>>>>             
>> docFreq=43
>>     
>>>>> !=
>>>>>
>>>>>           
>>>>>> num docs seen 4 + num docs deleted 0
>>>>>> WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num
>>>>>>             
>> docs
>>     
>>>>>> deleted 0
>>>>>> WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num
>>>>>>             
>> docs
>>     
>>>>>> deleted 0
>>>>>> WARNING: term  literals:cfid196$commandant docFreq=1 != num docs seen
>>>>>>             
>> 9
>>     
>>>>> +
>>>>>
>>>>>           
>>>>>> num docs deleted 0
>>>>>> WARNING: term  literals:cfid196$on docFreq=3178 != num docs seen 1 +
>>>>>>             
>> num
>>     
>>>>>> docs deleted 0
>>>>>> OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens]
>>>>>>    test: stored fields.......OK [1077204 total field count; avg 3
>>>>>>             
>> fields
>>     
>>>>>> per doc]
>>>>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>>>> vector fields per doc]
>>>>>>
>>>>>>  3 of 3: name=_2 docCount=304849
>>>>>>    compound=false
>>>>>>    hasProx=true
>>>>>>    numFiles=8
>>>>>>    size (MB)=962.004
>>>>>>    diagnostics = {os.version=5.2, os=Windows 2003,
>>>>>>             
>> lucene.version=2.9.0
>>     
>>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>>>>>>    docStoreOffset=772653
>>>>>>    docStoreSegment=_0
>>>>>>    docStoreIsCompoundFile=false
>>>>>>    no deletions
>>>>>>    test: open reader.........OK
>>>>>>    test: fields..............OK [33 fields]
>>>>>>    test: field norms.........OK [33 fields]
>>>>>>    test: terms, freq, prox...WARNING: term  contents:? docFreq=1 !=
>>>>>>             
>> num
>>     
>>>>>> docs seen 246 + num docs deleted 0
>>>>>> WARNING: term  literals:cfid196$ docFreq=45 != num docs seen 4 + num
>>>>>>
>>>>>>             
>>>>> docs
>>>>>
>>>>>           
>>>>>> deleted 0
>>>>>> WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num
>>>>>>             
>> docs
>>     
>>>>>> deleted 0
>>>>>> WARNING: term  literals:cfid196$cashier docFreq=1 != num docs seen 37
>>>>>>             
>> +
>>     
>>>>> num
>>>>>
>>>>>           
>>>>>> docs deleted 0
>>>>>> WARNING: term  literals:cfid196$interrogation docFreq=181 != num docs
>>>>>>
>>>>>>             
>>>>> seen 1
>>>>>
>>>>>           
>>>>>> + num docs deleted 0
>>>>>> WARNING: term  literals:cfid196$leader docFreq=1 != num docs seen 353
>>>>>>             
>> +
>>     
>>>>> num
>>>>>
>>>>>           
>>>>>> docs deleted 0
>>>>>> WARNING: term  literals:cfid196$microsoft docFreq=3114 != num docs
>>>>>>             
>> seen
>>     
>>>>> 1 +
>>>>>
>>>>>           
>>>>>> num docs deleted 0
>>>>>> WARNING: term  literals:cfid196$nt docFreq=200 != num docs seen 1 +
>>>>>>             
>> num
>>     
>>>>> docs
>>>>>
>>>>>           
>>>>>> deleted 0
>>>>>> OK [6497769 terms; 145296880 terms/docs pairs; 293458734 tokens]
>>>>>>    test: stored fields.......OK [914547 total field count; avg 3
>>>>>>             
>> fields
>>     
>>>>> per
>>>>>
>>>>>           
>>>>>> doc]
>>>>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>>>> vector fields per doc]
>>>>>>
>>>>>> No problems were detected with this index.
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 28, 2009 at 11:29 AM, Michael McCandless <
>>>>>> lucene@mikemccandless.com> wrote:
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan <
>>>>>>>               
>> peterlkeegan@gmail.com
>>     
>>>>>>> wrote:
>>>>>>>
>>>>>>>               
>>>>>>>> The only change I made to the source code was the patch for
>>>>>>>>
>>>>>>>>                 
>>>>>>> PayloadNearQuery
>>>>>>>
>>>>>>>               
>>>>>>>> (LUCENE-1986).
>>>>>>>>
>>>>>>>>                 
>>>>>>> That patch certainly shouldn't lead to this.
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> It's possible that our content contains U+FFFF. I will run in
>>>>>>>>
>>>>>>>>                 
>>>>> debugger
>>>>>
>>>>>           
>>>>>>> and
>>>>>>>
>>>>>>>               
>>>>>>>> see.
>>>>>>>>
>>>>>>>>                 
>>>>>>> OK may as well check just so we cover all possibilities.
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> The data is 'sensitive', so I may not be able to provide a bad
>>>>>>>>
>>>>>>>>                 
>>>>> segment,
>>>>>
>>>>>           
>>>>>>>> unfortunately.
>>>>>>>>
>>>>>>>>                 
>>>>>>> OK, maybe we can modify your CheckIndex instead.  Let's start with
>>>>>>> this, which prints a warning whenever the docFreq differs but
>>>>>>> otherwise continues (vs throwing RuntimeException).  I'm curious how
>>>>>>> many terms show this, and whether the TermEnum keeps working after
>>>>>>> this term that has different docFreq:
>>>>>>>
>>>>>>> Index: src/java/org/apache/lucene/index/CheckIndex.java
>>>>>>> ===================================================================
>>>>>>> --- src/java/org/apache/lucene/index/CheckIndex.java    (revision
>>>>>>>
>>>>>>>               
>>>>> 829889)
>>>>>
>>>>>           
>>>>>>> +++ src/java/org/apache/lucene/index/CheckIndex.java    (working
>>>>>>>               
>> copy)
>>     
>>>>>>> @@ -672,8 +672,8 @@
>>>>>>>         }
>>>>>>>
>>>>>>>         if (freq0 + delCount != docFreq) {
>>>>>>> -          throw new RuntimeException("term " + term + " docFreq=" +
>>>>>>> -                                     docFreq + " != num docs seen "
>>>>>>>               
>> +
>>     
>>>>>>> freq0 + " + num docs deleted " + delCount);
>>>>>>> +          System.out.println("WARNING: term  " + term + " docFreq="
>>>>>>>               
>> +
>>     
>>>>>>> +                             docFreq + " != num docs seen " + freq0
>>>>>>>               
>> +
>>     
>>>>>>> " + num docs deleted " + delCount);
>>>>>>>         }
>>>>>>>       }
>>>>>>>
>>>>>>> Mike
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>       
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


-- 
- Mark

http://www.lucidimagination.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org