lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Kristensson <mark.kristens...@smartsheet.com>
Subject Re: IndexWriter.close() performance issue
Date Fri, 19 Nov 2010 00:09:29 GMT
I finally bucked up and made the change to CheckIndex to verify that I do not, in fact, have
any fields with norms in this index. The result is below - the largest segment currently is
#3, which 300,000+ fields but no norms.

-Mark



Segments file=segments_acew numSegments=9 version=FORMAT_DIAGNOSTICS [Lucene 2.9]
  1 of 9: name=_bfkv docCount=8642
    compound=true
    hasProx=true
    numFiles=2
    size (MB)=7.921
    diagnostics = {optimize=false, mergeFactor=1, os.version=2.6.18-194.26.1.el5, os=Linux,
mergeDocStores=true, lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_bfkv_1.del]
    test: open reader.........OK [77 deleted docs]
    test: fields..............OK [114226 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [103996 terms; 926779 terms/docs pairs; 919464 tokens]
    test: stored fields.......OK [202850 total field count; avg 23.684 fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per
doc]

  2 of 9: name=_1gi5 docCount=0
    compound=true
    hasProx=true
    numFiles=1
    size (MB)=0.001
    diagnostics = {optimize=false, mergeFactor=1, os.version=2.6.18-128.7.1.el5, os=Linux,
mergeDocStores=true, lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    no deletions
    test: open reader.........OK
    test: fields..............OK [28 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [0 terms; 0 terms/docs pairs; 0 tokens]
    test: stored fields.......OK [0 total field count; avg � fields per doc]
    test: term vectors........OK [0 total vector count; avg � term/freq vector fields per
doc]

  3 of 9: name=_bfkw docCount=6433351
    compound=true
    hasProx=true
    numFiles=2
    size (MB)=3,969.392
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.18-194.26.1.el5, os=Linux,
mergeDocStores=true, lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_bfkw_s7.del]
    test: open reader.........OK [89111 deleted docs]
    test: fields..............OK [308832 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [47362222 terms; 733184933 terms/docs pairs; 720556927 tokens]
    test: stored fields.......OK [186735038 total field count; avg 29.434 fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per
doc]

  4 of 9: name=_bglk docCount=100296
    compound=true
    hasProx=true
    numFiles=2
    size (MB)=83.448
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.18-194.26.1.el5, os=Linux,
mergeDocStores=true, lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_bglk_1p.del]
    test: open reader.........OK [7027 deleted docs]
    test: fields..............OK [19192 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [1342162 terms; 13987377 terms/docs pairs; 13126384 tokens]
    test: stored fields.......OK [3713794 total field count; avg 39.818 fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per
doc]

  5 of 9: name=_bglt docCount=3123
    compound=true
    hasProx=true
    numFiles=2
    size (MB)=1.999
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.18-194.26.1.el5, os=Linux,
mergeDocStores=true, lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_bglt_q.del]
    test: open reader.........OK [878 deleted docs]
    test: fields..............OK [911 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [28803 terms; 345429 terms/docs pairs; 218626 tokens]
    test: stored fields.......OK [73229 total field count; avg 32.619 fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per
doc]

  6 of 9: name=_bgme docCount=2339
    compound=true
    hasProx=true
    numFiles=2
    size (MB)=1.704
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.18-194.26.1.el5, os=Linux,
mergeDocStores=true, lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_bgme_h.del]
    test: open reader.........OK [329 deleted docs]
    test: fields..............OK [1122 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [30846 terms; 316451 terms/docs pairs; 272709 tokens]
    test: stored fields.......OK [69847 total field count; avg 34.75 fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per
doc]

  7 of 9: name=_bgnj docCount=2941
    compound=true
    hasProx=true
    numFiles=2
    size (MB)=2.2
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.18-194.26.1.el5, os=Linux,
mergeDocStores=true, lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_bgnj_d.del]
    test: open reader.........OK [527 deleted docs]
    test: fields..............OK [1846 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [42379 terms; 412630 terms/docs pairs; 341300 tokens]
    test: stored fields.......OK [83805 total field count; avg 34.716 fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per
doc]

  8 of 9: name=_bgo4 docCount=3899
    compound=true
    hasProx=true
    numFiles=1
    size (MB)=2.988
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.18-194.26.1.el5, os=Linux,
mergeDocStores=true, lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    no deletions
    test: open reader.........OK
    test: fields..............OK [1367 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [52773 terms; 505461 terms/docs pairs; 505461 tokens]
    test: stored fields.......OK [160630 total field count; avg 41.198 fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per
doc]

  9 of 9: name=_bgo5 docCount=4
    compound=true
    hasProx=true
    numFiles=1
    size (MB)=0.007
    diagnostics = {os.version=2.6.18-194.26.1.el5, os=Linux, lucene.version=3.0.0 883080 -
2009-11-22 15:43:58, source=flush, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems
Inc.}
    no deletions
    test: open reader.........OK
    test: fields..............OK [87 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [298 terms; 440 terms/docs pairs; 440 tokens]
    test: stored fields.......OK [95 total field count; avg 23.75 fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per
doc]



On Nov 17, 2010, at 1:51 PM, Michael McCandless wrote:

> Lucene interns field names... since you have a truly enormous number
> of unique fields it's expected intern will be called alot.
> 
> But that said it's odd that it's this costly.
> 
> Can you post the stack traces that call intern?
> 
> Mike
> 
> On Fri, Nov 5, 2010 at 1:53 PM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>> Hmm...
>> 
>> So, I was going on this output from your CheckIndex:
>> 
>>   test: field norms.........OK [296713 fields]
>> 
>> But in fact I just looked and that number is bogus -- it's always
>> equal to total number of fields, not number of fields with norms
>> enabled.  I'll open an issue to fix this, but in the meantime can you
>> apply this patch to your CheckIndex and run it again?
>> 
>> Index: src/java/org/apache/lucene/index/CheckIndex.java
>> ===================================================================
>> --- src/java/org/apache/lucene/index/CheckIndex.java    (revision 1031678)
>> +++ src/java/org/apache/lucene/index/CheckIndex.java    (working copy)
>> @@ -570,8 +570,10 @@
>>       }
>>       final byte[] b = new byte[reader.maxDoc()];
>>       for (final String fieldName : fieldNames) {
>> -        reader.norms(fieldName, b, 0);
>> -        ++status.totFields;
>> +        if (reader.hasNorms(fieldName)) {
>> +          reader.norms(fieldName, b, 0);
>> +          ++status.totFields;
>> +        }
>>       }
>> 
>>       msg("OK [" + status.totFields + " fields]");
>> 
>> So if in fact you have already disabled norms then something else is
>> the source of the sudden slowness.  Though, such a huge number of
>> unique field names is not an area of Lucene that's very well tested...
>> perhaps there's something silly somewhere.  Maybe you can try
>> profiling just the init of your IndexReader?  (Eg, run java with
>> -agentlib:hprof=cpu=samples,depth=16,interval=1).
>> 
>> Yes, both Index.NOT_ANALYZED_NO_NORMS and Index.NO will disable norms
>> as long as no document in the index ever had norms on (yes it does
>> "infect" heh).
>> 
>> Mike
>> 
>> On Fri, Nov 5, 2010 at 1:37 PM, Mark Kristensson
>> <mark.kristensson@smartsheet.com> wrote:
>>> While most of our Lucene indexes are used for more traditional searching, this
index in particular is used more like a reporting repository. Thus, we really do need to have
that many fields indexed and they do need to be broken out into separate fields. There may
be another way to structure the index to reduce the number of fields, but I'm hoping we can
optimize the current design and avoid (yet another) index redesign.
>>> 
>>> I'll look into the tweaking the merge policy, but I'm more interested in disabling
norms because scoring really doesn't matter for us. Basically, we need nothing more than a
binary answer from Lucene: either a record meets the provided criteria (which can be a rather
complex boolean query with many subqueries) or it doesn't. If the record does match, then
we get the IDs from lucene and run off to get the live data from our primary data store and
sort it (in Java) based upon criteria provided by the user, not by score.
>>> 
>>> After our initial design mushroomed in size, we redesigned and now (I thought)
do not have norms on any of the fields in this index. So, I'm wondering if there was something
in the results from the CheckIndex that I provided which indicates to you that we may have
norms still enabled? I know that if you have norms on any one document's field, then any other
document with that same field will get "infected" with norms as well.
>>> 
>>> My understanding is that any field that uses the constants  Index.NOT_ANALYZED_NO_NORMS
or  Index.NO will not  have norms on it, regardless of whether or not the field is stored.
Is that not correct?
>>> 
>>> Thanks,
>>> Mark
>>> 
>>> 
>>> 
>>> On Nov 4, 2010, at 2:56 AM, Michael McCandless wrote:
>>> 
>>>> Likely what happened is you had a bunch of smaller segments, and then
>>>> suddenly they got merged into that one big segment (_aiaz) in your
>>>> index.
>>>> 
>>>> The representation for norms in particular is not sparse, so this
>>>> means the size of the norms file for a given segment will be
>>>> number-of-unique-indexed-fields X number-of-documents.
>>>> 
>>>> So this count grows quadratically on merge.
>>>> 
>>>> Do these fields really need to be indexed?   If so, it'd be better to
>>>> use a single field for all users for the indexable text if you can.
>>>> 
>>>> Failing that, a simple workaround is to set the maxMergeMB/Docs on the
>>>> merge policy; this'd prevent big segments from being produced.
>>>> Disabling norms should also workaround this, though that will affect
>>>> hit scores...
>>>> 
>>>> Mike
>>>> 
>>>> On Wed, Nov 3, 2010 at 7:37 PM, Mark Kristensson
>>>> <mark.kristensson@smartsheet.com> wrote:
>>>>> Yes, we do have a large number of unique field names in that index, because
they are driven by user named fields in our application (with some cleaning to remove illegal
chars).
>>>>> 
>>>>> This slowness problem has appeared very suddenly in the last couple of
weeks and the number of unique field names has not spiked in the last few weeks. Have we crept
over some threshold with our linear growth in the number of unique field names? Perhaps there
is a limit driven by the amount of RAM in the machine that we are violating? Are there any
guidelines for the maximum number, or suggested number, of unique fields names in an index
or segment? Any suggestions for potentially mitigating the problem?
>>>>> 
>>>>> Thanks,
>>>>> Mark
>>>>> 
>>>>> 
>>>>> On Nov 3, 2010, at 2:02 PM, Michael McCandless wrote:
>>>>> 
>>>>>> On Wed, Nov 3, 2010 at 4:27 PM, Mark Kristensson
>>>>>> <mark.kristensson@smartsheet.com> wrote:
>>>>>>> 
>>>>>>> I've run checkIndex against the index and the results are below.
That net is that it's telling me nothing is wrong with the index.
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>>> I did not have any instrumentation around the opening of the
IndexSearcher (we don't use an IndexReader), just around the actual query execution so I had
to add some additional logging. What I found surprised me, opening a search against this index
takes the same 6 to 8 seconds that closing the indexWriter takes.
>>>>>> 
>>>>>> IndexWriter opens a SegmentReader for each segment in the index,
to
>>>>>> apply deletions, so I think this is the source of the slowness.
>>>>>> 
>>>>>> From the CheckIndex output, it looks like you have many (296,713)
>>>>>> unique fields names on that one large segment -- does that sound
>>>>>> right?  I suspect such a very high field count is the source of the
>>>>>> slowness...
>>>>>> 
>>>>>> Mike
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> 
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>> 
>>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message