lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <luc...@mikemccandless.com>
Subject Re: Post mortem kudos for (LUCENE-843) :)
Date Fri, 13 Jul 2007 12:13:42 GMT
"eks dev" <eksdev@yahoo.co.uk> wrote:

> > Was 24M (and not more) clearly the fastest performance?
> 
> No, this is kind of optimum. Throwing more memory up to 32M makes things
> slightly faster at slow rate, having maximum at 32.  After that things
> start getting slower   (slowly)

Interesting.  This matches the experience Doron had where adding more
RAM actually slowed things down a bit (posted to
LUCENE-843).

> We are not yet completely done with tuning, especially  with two tips 
> you mentioned in this mail.
> Fields are already reused, but

Super.

> 1. Reusing Document, this is one new Vector() in there  (and at these
> speeds, something like this  makes difference!!!) 
> in Document  List fields = new Vector(); (by the way, must this be
> synchronized  Vector? Why not ArrayList? Any difference from it)

Oh yeah, it would be good to not "new Vector()" every time.

What I did in the benchmarking for LUCENE-843 was make a single
Document, make my N fields (using my own class that implements
Fieldable but lets me change the value), add these fields to the
Document, and then hold onto the fields as local variables (textField,
titleField, idField, etc.).

Then for each doc I just set the field values
(textField.setValue(...), etc.) and then call writer.addDocument(doc).

> 2. Reusing Field, excuse my ignorance, but how I can do it? with Document
> is easy with 
> luceneDocument.add(field)
> luceneDocument.removeFields(name) //Wouldn't be better to have
> luceneDocument.removeAllFields()

Yeah it's not so easy now: Field.java does not have setters.

You have to make your own class that implements Fieldable (or
subclasses AbstractField) and adds your own setters.  Field.java is
also [currently] final so you can't subclass it.

In the benchmarking code (see patch in
http://issues.apache.org/jira/browse/LUCENE-947) I created a
ReusableStringField that lets you setStringValue(...).  You could use
that as your Field class.

Alternatively you can make a "ReusableStringReader" (there's one in
DocumentsWriter in the trunk now) and then use the normal Field class
but pass in your instance of ReusableStringReader.  This approach
could be faster if you implemented it to use a char[] instead of a
String (the current one in DocumentsWriter reads a String).

> 3. "LUCENE-845" Whoops, I totally overlooked this one! And I am sure my
> maxBufferedDocs is well under what fits in 24Mb?!?  Any good tip on how
> to determine good number: count added docs and see how far this number
> goes before flush() triggers (how I detect when flush by ram gets
> triggered?) and than add 10% to this number...

Whoa, OK.

First you need to figure out how many docs are "typically" getting
flushed at 24 MB.  Easiest way would be to call
writer.setInfoStream(System.out) and look for the lines that say
"flush postings as segment XXX numDocs=YYY".  Likely your YYY is
"fairly" close every time since your docs are so predictable in size.

Then, set your maxBufferedDocs anywhere above YYY and below 10 * YYY
and you shouldn't hit LUCENE-845 (actually 5.5 * YYY is best since it
gives you max safety margin).  Note that you should call
setMaxBufferedDocs(...) first and then call setRamBufferSizeMB(...)
in that order.  If you do it backwards then the writer will flush @
exactly that number of buffered docs instead.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message