lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: IndexWriter.close() performance issue
Date Sat, 20 Nov 2010 05:10:56 GMT
I actually think that the main reason for interning the field names in
Lucene is for comparison purposes and not to guarantee uniqueness (though
you get both). You will see many places in the Lucene's code where the field
name is compared using != operator instead of equals.

BTW, in your patch above, you use the hashCode as the key, which for Strings
is a bad idea especially if they are not all ASCII - Java's hashCode is
32-bit which is very bad -- you'll hit collisions very fast. I once counted
1M collisions on 100M ASCII words and even more when non-ASCII text was
involved.

Shai

On Sat, Nov 20, 2010 at 12:41 AM, Mark Kristensson <
mark.kristensson@smartsheet.com> wrote:

> My findings from the hprof results which showed 80% of the CPU time being
> in String.intern() led me to do some reading about String.intern() and what
> I found surprised me.
>
> First, there are some very strong feelings about String.intern() and its
> value. First, is this guy (
> http://www.codeinstructions.com/2009/01/busting-javalangstringintern-myths.html)
> who doesn't have much good to say about it. This guy (
> http://kohlerm.blogspot.com/2009/01/is-javalangstringintern-really-evil.html)
> responds to the key points, but acknowledges there are some issues with it.
> The net, to me, is that using String.intern() has some potential performance
> benefits IF you KNOW that the set of strings you are intern-ing will be
> small. If the set is large...watch out. The one element that has me
> especially worried is that intern-ing a String leads to it getting stored in
> PermGen space.
>
> From what I've read, the main reasons to intern a string are 1) comparison
> performance and 2) uniqueness (including having a single instance of the
> String). In looking at the source code for
> org.apache.lucene.util.StringHelper, StringInterner and SimpleStringInterner
> it appears that the primary reason for intern-ing strings in Lucene is for
> uniqueness (but that's just a guess). So, I decided to try tackling the
> uniqueness problem a different way: HashMap! I modified StringHelper to use
> a static instance of a HashMap instead of SimpleStringInterner just to see
> if it would make any difference.
>
> Much to my surprise, the change was very easy to make (see below) and (for
> the IndexReader.open) massively improved the performance of opening my
> large, problematic index. Instead of taking nearly 10 seconds to open the
> index on one of our lab servers, it takes only a couple of seconds for this
> index which has over 300,000 field names.
>
> Now, there are a lot of caveats to this test
> 1) Right now, I'm only testing a tiny little bit of even my own
> application's usage of Lucene, so I have no idea what the large impact of
> this change will be on my own app
> 2) I really am guessing at the reasons for intern-ing the Strings. I'm
> going to keep digging through source code to see where else StringHelper is
> used and how it is used, but I'm not going to be an expert on this code any
> time soon.
>
> Finally, this seems to address the issues I'm having opening IndexReaders
> against this index, but I have not been able to repro the
> IndexWriter.close() slowness that initially  led me down this path in my
> test app (though I can repro them our lab with our full app).
>
> Here's the changes I made to org.apache.lucene.util.StringHelper:
>
>  //public static StringInterner interner = new
> SimpleStringInterner(1024,8);
>  public static HashMap<Integer, String> uniqueStrings = new
> HashMap<Integer, String>();
>
>  /** Return the same string object for all equal strings */
>  public static String intern(String s) {
> //    return interner.intern(s);
>    int h = s.hashCode();
>    if (!uniqueStrings.containsKey(h)) {
>        uniqueStrings.put(h, s);
>    }
>    return uniqueStrings.get(h);
>  }
>
>
>
> On Nov 17, 2010, at 1:51 PM, Michael McCandless wrote:
>
> > Lucene interns field names... since you have a truly enormous number
> > of unique fields it's expected intern will be called alot.
> >
> > But that said it's odd that it's this costly.
> >
> > Can you post the stack traces that call intern?
> >
> > Mike
> >
> > On Fri, Nov 5, 2010 at 1:53 PM, Michael McCandless
> > <lucene@mikemccandless.com> wrote:
> >> Hmm...
> >>
> >> So, I was going on this output from your CheckIndex:
> >>
> >>   test: field norms.........OK [296713 fields]
> >>
> >> But in fact I just looked and that number is bogus -- it's always
> >> equal to total number of fields, not number of fields with norms
> >> enabled.  I'll open an issue to fix this, but in the meantime can you
> >> apply this patch to your CheckIndex and run it again?
> >>
> >> Index: src/java/org/apache/lucene/index/CheckIndex.java
> >> ===================================================================
> >> --- src/java/org/apache/lucene/index/CheckIndex.java    (revision
> 1031678)
> >> +++ src/java/org/apache/lucene/index/CheckIndex.java    (working copy)
> >> @@ -570,8 +570,10 @@
> >>       }
> >>       final byte[] b = new byte[reader.maxDoc()];
> >>       for (final String fieldName : fieldNames) {
> >> -        reader.norms(fieldName, b, 0);
> >> -        ++status.totFields;
> >> +        if (reader.hasNorms(fieldName)) {
> >> +          reader.norms(fieldName, b, 0);
> >> +          ++status.totFields;
> >> +        }
> >>       }
> >>
> >>       msg("OK [" + status.totFields + " fields]");
> >>
> >> So if in fact you have already disabled norms then something else is
> >> the source of the sudden slowness.  Though, such a huge number of
> >> unique field names is not an area of Lucene that's very well tested...
> >> perhaps there's something silly somewhere.  Maybe you can try
> >> profiling just the init of your IndexReader?  (Eg, run java with
> >> -agentlib:hprof=cpu=samples,depth=16,interval=1).
> >>
> >> Yes, both Index.NOT_ANALYZED_NO_NORMS and Index.NO will disable norms
> >> as long as no document in the index ever had norms on (yes it does
> >> "infect" heh).
> >>
> >> Mike
> >>
> >> On Fri, Nov 5, 2010 at 1:37 PM, Mark Kristensson
> >> <mark.kristensson@smartsheet.com> wrote:
> >>> While most of our Lucene indexes are used for more traditional
> searching, this index in particular is used more like a reporting
> repository. Thus, we really do need to have that many fields indexed and
> they do need to be broken out into separate fields. There may be another way
> to structure the index to reduce the number of fields, but I'm hoping we can
> optimize the current design and avoid (yet another) index redesign.
> >>>
> >>> I'll look into the tweaking the merge policy, but I'm more interested
> in disabling norms because scoring really doesn't matter for us. Basically,
> we need nothing more than a binary answer from Lucene: either a record meets
> the provided criteria (which can be a rather complex boolean query with many
> subqueries) or it doesn't. If the record does match, then we get the IDs
> from lucene and run off to get the live data from our primary data store and
> sort it (in Java) based upon criteria provided by the user, not by score.
> >>>
> >>> After our initial design mushroomed in size, we redesigned and now (I
> thought) do not have norms on any of the fields in this index. So, I'm
> wondering if there was something in the results from the CheckIndex that I
> provided which indicates to you that we may have norms still enabled? I know
> that if you have norms on any one document's field, then any other document
> with that same field will get "infected" with norms as well.
> >>>
> >>> My understanding is that any field that uses the constants
>  Index.NOT_ANALYZED_NO_NORMS or  Index.NO will not  have norms on it,
> regardless of whether or not the field is stored. Is that not correct?
> >>>
> >>> Thanks,
> >>> Mark
> >>>
> >>>
> >>>
> >>> On Nov 4, 2010, at 2:56 AM, Michael McCandless wrote:
> >>>
> >>>> Likely what happened is you had a bunch of smaller segments, and then
> >>>> suddenly they got merged into that one big segment (_aiaz) in your
> >>>> index.
> >>>>
> >>>> The representation for norms in particular is not sparse, so this
> >>>> means the size of the norms file for a given segment will be
> >>>> number-of-unique-indexed-fields X number-of-documents.
> >>>>
> >>>> So this count grows quadratically on merge.
> >>>>
> >>>> Do these fields really need to be indexed?   If so, it'd be better to
> >>>> use a single field for all users for the indexable text if you can.
> >>>>
> >>>> Failing that, a simple workaround is to set the maxMergeMB/Docs on the
> >>>> merge policy; this'd prevent big segments from being produced.
> >>>> Disabling norms should also workaround this, though that will affect
> >>>> hit scores...
> >>>>
> >>>> Mike
> >>>>
> >>>> On Wed, Nov 3, 2010 at 7:37 PM, Mark Kristensson
> >>>> <mark.kristensson@smartsheet.com> wrote:
> >>>>> Yes, we do have a large number of unique field names in that index,
> because they are driven by user named fields in our application (with some
> cleaning to remove illegal chars).
> >>>>>
> >>>>> This slowness problem has appeared very suddenly in the last couple
> of weeks and the number of unique field names has not spiked in the last few
> weeks. Have we crept over some threshold with our linear growth in the
> number of unique field names? Perhaps there is a limit driven by the amount
> of RAM in the machine that we are violating? Are there any guidelines for
> the maximum number, or suggested number, of unique fields names in an index
> or segment? Any suggestions for potentially mitigating the problem?
> >>>>>
> >>>>> Thanks,
> >>>>> Mark
> >>>>>
> >>>>>
> >>>>> On Nov 3, 2010, at 2:02 PM, Michael McCandless wrote:
> >>>>>
> >>>>>> On Wed, Nov 3, 2010 at 4:27 PM, Mark Kristensson
> >>>>>> <mark.kristensson@smartsheet.com> wrote:
> >>>>>>>
> >>>>>>> I've run checkIndex against the index and the results are
below.
> That net is that it's telling me nothing is wrong with the index.
> >>>>>>
> >>>>>> Thanks.
> >>>>>>
> >>>>>>> I did not have any instrumentation around the opening of
the
> IndexSearcher (we don't use an IndexReader), just around the actual query
> execution so I had to add some additional logging. What I found surprised
> me, opening a search against this index takes the same 6 to 8 seconds that
> closing the indexWriter takes.
> >>>>>>
> >>>>>> IndexWriter opens a SegmentReader for each segment in the index,
to
> >>>>>> apply deletions, so I think this is the source of the slowness.
> >>>>>>
> >>>>>> From the CheckIndex output, it looks like you have many (296,713)
> >>>>>> unique fields names on that one large segment -- does that sound
> >>>>>> right?  I suspect such a very high field count is the source
of the
> >>>>>> slowness...
> >>>>>>
> >>>>>> Mike
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>>
> >>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>
> >>>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message