lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: NumericField indexing performance
Date Thu, 15 Apr 2010 12:15:14 GMT
"Read" means "re-add", the spell checker in my mail program :-)

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Uwe Schindler [mailto:uwe@thetaphi.de]
> Sent: Thursday, April 15, 2010 2:13 PM
> To: java-user@lucene.apache.org
> Subject: RE: NumericField indexing performance
> 
> Hi Tomislav,
> 
> when reading your mail its not 100% clear what you did wrong, but I
> think the following occurred (so its no GC problem):
> 
> You reused the Document and NumericField instance in your original
> approach. But on each document you called again doc.add(nf). By that
> for each document you added the field one more time to the document and
> after say thousand docs you have 1000 times the numeric field there and
> indexer indexes it therefore 1000 times. After 2000 docs it's there
> 2000 times so the indexing time raises exponentially.
> 
> So when you reuse doc instances you have to do do either:
> - Don’t modify the fields at all (and also add no more fields) and just
> set field values and add doc to writer
> - Clear the document and read fields
> 
> But don’t read fields without clearing! :-) That was your fault.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> 
> > -----Original Message-----
> > From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> > Sent: Thursday, April 15, 2010 2:00 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: NumericField indexing performance
> >
> > Hi,
> >
> > I actually don't follow your change, because after "but changing it
> to"
> > line the only different thing I see is the doc.add(dateField) call,
> > which you didn't list before "but changing it to".
> >
> > Also, if I understood Uwe correctly, he was suggesting reusing
> > NumericField instances, which means "new NumericField("date")" should
> > exist and be called for only *once* in your code.  The same for
> > Document instances.  GC threads will thank you and Uwe for this
> change.
> >  Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Hadoop ecosystem search :: http://search-hadoop.com/
> >
> >
> >
> > ----- Original Message ----
> > > From: Tomislav Poljak <tpoljak@gmail.com>
> > > To: java-user@lucene.apache.org
> > > Sent: Thu, April 15, 2010 7:41:02 AM
> > > Subject: RE: NumericField indexing performance
> > >
> > > Hi Uwe,
> > thank you very much for your answers. I've done Document
> > > and
> > NumericField reuse like this:
> >
> > Document doc =
> > > getDocument();
> > NumericField dateField = new NumericField("date");
> >
> > for
> > > each
> > > doc:
> >
> >
> doc.add(dateField.setLongValue(Long.parseLong(DateTools.dateToString(da
> > te),
> > > DateTools.Resolution.MINUTE))));
> >
> > ,but changing it to:
> >
> > Document doc
> > > = getDocument();
> > NumericField dateField = new
> > > NumericField("date");
> > doc.add(dateField);
> >
> > for each
> > > doc:
> >
> > dateField.setLongValue(Long.parseLong(DateTools.dateToString(date),
> > DateTools.Resolution.MINUTE)));
> >
> > did
> > > the trick. Now indexing with NumericField takes minutes, not
> > > hours.
> >
> > Thanks again,
> >
> > Tomislav
> >
> >
> >
> >
> >
> > On Wed,
> > > 2010-04-14 at 23:38 +0200, Uwe Schindler wrote:
> > > One addition:
> > > If
> > > you are indexing millions of numeric fields, you should also try to
> > reuse
> > > NumericField and Document instances (as described in JavaDocs).
> > NumericField
> > > creates internally a NumericTokenStream and lots of small objects
> > (attributes),
> > > so GC cost may be high. This is just another idea.
> > >
> > > Uwe
> > >
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213
> > > Bremen
> > >
> > > >http://www.thetaphi.de
> > > eMail:
> > > href="mailto:uwe@thetaphi.de">uwe@thetaphi.de
> > >
> > >
> > > >
> > > -----Original Message-----
> > > > From: Uwe Schindler [mailto:
> > > ymailto="mailto:uwe@thetaphi.de"
> > > href="mailto:uwe@thetaphi.de">uwe@thetaphi.de]
> > > > Sent: Wednesday,
> > > April 14, 2010 11:28 PM
> > > > To:
> > > ymailto="mailto:java-user@lucene.apache.org"
> > > href="mailto:java-user@lucene.apache.org">java-
> user@lucene.apache.org
> > >
> > > > Subject: RE: NumericField indexing performance
> > > >
> > > >
> > > Hi Tomislav,
> > > >
> > > > indexing with NumericField takes longer
> > > (at least for the default
> > > > precision step of 4, which means out of
> > > 32 bit integers make 8 subterms
> > > > with each 4 bits of the value). So
> > > you produce 8 times more terms
> > > > during indexing that must be handled
> > > by the indexer. If you have lots
> > > > of documents, with distinct values
> > > the term index gets larger and
> > > > larger, but search performance
> > > increases dramatically (for
> > > > NumericRangeQueries). So if you index
> > > *only* numeric fields and nothing
> > > > else, a 8 times slower indexing
> > > can be true.
> > > >
> > > > If you are not using NumericRangeQuery
> > > or you want tune indexing
> > > > performance, try larger precision Steps
> > > like 6 or 8. If you don’t use
> > > > NumericRangeQuery and only want to
> > > index the numeric terms as *one*
> > > > term, use
> > > precStep=Integer.MAX_VALUE. Also check your memory
> > > > requirements, as
> > > the indexer may need more memory and GC costs too
> > > > much. Also the
> > > index size will increase, so lots of more I/O is done.
> > > > Without more
> > > details I cannot say anything about your configuration. So
> > > > please
> > > tell us, how many documents, how many fields and how many
> > > > numeric
> > > fields in which configuration do you use?
> > > >
> > > > Uwe
> > >
> > > >
> > > > -----
> > > > Uwe Schindler
> > > >
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > >
> > > href="http://www.thetaphi.de" target=_blank >http://www.thetaphi.de
> > >
> > > > eMail:
> > > href="mailto:uwe@thetaphi.de">uwe@thetaphi.de
> > > >
> > > >
> > >
> > > > > -----Original Message-----
> > > > > From: Tomislav
> > > Poljak [mailto:
> > > href="mailto:tpoljak@gmail.com">tpoljak@gmail.com]
> > > > > Sent:
> > > Wednesday, April 14, 2010 8:13 PM
> > > > > To:
> > > ymailto="mailto:java-user@lucene.apache.org"
> > > href="mailto:java-user@lucene.apache.org">java-
> user@lucene.apache.org
> > >
> > > > > Subject: NumericField indexing performance
> > > > >
> > >
> > > > > Hi,
> > > > > is it normal for indexing time to increase up to
> > > 10 times after
> > > > > introducing NumericField instead of Field (for
> > > two fields)?
> > > > >
> > > > > I've changed two date fields
> > > from String representation (Field) to
> > > > > NumericField, now it
> > > is:
> > > > >
> > > > > doc.add(new
> > > NumericField("time").setIntValue(date.getTime()/24/3600))
> > > >
> > > >
> > > > > and after this change indexing took 10x more time (before
> > > it was few
> > > > > minutes and after more than an hour and half). I've
> > > tested with a
> > > > > simple
> > > > > counter like
> > > this:
> > > > >
> > > > > doc.add(new
> > > NumericField("endTime").setIntValue(count++))
> > > > >
> > > >
> > > > but nothing changed, it still takes around 10x longer. If I
> comment
> > >
> > > > > adding one numeric field to index time drops significantly and
> if
> > > I
> > > > > comment both fields indexing takes only few minutes
> > > again.
> > > > >
> > > > > Tomislav
> > > > >
> > >
> > > > >
> > > > >
> > > -------------------------------------------------------------------
> --
> > >
> > > > > To unsubscribe, e-mail:
> > > ymailto="mailto:java-user-unsubscribe@lucene.apache.org"
> > > href="mailto:java-user-unsubscribe@lucene.apache.org">java-user-
> > unsubscribe@lucene.apache.org
> > >
> > > > > For additional commands, e-mail:
> > > ymailto="mailto:java-user-help@lucene.apache.org"
> > > href="mailto:java-user-help@lucene.apache.org">java-user-
> > help@lucene.apache.org
> > >
> > > >
> > > >
> > > >
> > > >
> > > -------------------------------------------------------------------
> --
> > >
> > > > To unsubscribe, e-mail:
> > > ymailto="mailto:java-user-unsubscribe@lucene.apache.org"
> > > href="mailto:java-user-unsubscribe@lucene.apache.org">java-user-
> > unsubscribe@lucene.apache.org
> > >
> > > > For additional commands, e-mail:
> > > ymailto="mailto:java-user-help@lucene.apache.org"
> > > href="mailto:java-user-help@lucene.apache.org">java-user-
> > help@lucene.apache.org
> > >
> > >
> > >
> > >
> > >
> > > -------------------------------------------------------------------
> --
> > > To
> > > unsubscribe, e-mail:
> > > href="mailto:java-user-unsubscribe@lucene.apache.org">java-user-
> > unsubscribe@lucene.apache.org
> > >
> > > For additional commands, e-mail:
> > > ymailto="mailto:java-user-help@lucene.apache.org"
> > > href="mailto:java-user-help@lucene.apache.org">java-user-
> > help@lucene.apache.org
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To
> > > unsubscribe, e-mail:
> > > href="mailto:java-user-unsubscribe@lucene.apache.org">java-user-
> > unsubscribe@lucene.apache.org
> > For
> > > additional commands, e-mail:
> > > ymailto="mailto:java-user-help@lucene.apache.org"
> > > href="mailto:java-user-help@lucene.apache.org">java-user-
> > help@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message