lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <luc...@mikemccandless.com>
Subject Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
Date Sat, 16 Feb 2008 10:59:59 GMT
Good question!

It didn't actually change when we flush: inside balanceRAM() we only
flush on when numBytesUsed >= ramBufferSize.  So previously it was
costing an additional (unnecessary) synchronized method call once we
got to 95% of ramBufferSize.

Mike

Felipe Albrecht <felipe.albrecht@gmail.com> wrote:
> Hello,
>
> I have a simple question about this patch.
>
> In the following patch segment, it is shown that the threadshould for
> synchronize the data changed.
>
>  if (ramBufferSize != IndexWriter.DISABLE_AUTO_FLUSH
> - && numBytesUsed > 0.95 * ramBufferSize)
> + && numBytesUsed >= ramBufferSize)
>  balanceRAM();
>
> Why it was changed and it *may be* is not influencing some time result?
> In other words, it's saying: "use more ram before to flush", and doing
> larger flushes,
> and less quantity of them, may be is influencing the final time.
>
> I am a bit new in Lucene, ony 2 weeks, but it pointed my attention.
>
> Thank you,
>
> Felipe Albrecht
>
>
> On Feb 11, 2008 5:30 PM, Michael McCandless <lucene@mikemccandless.com> wrote:
> >
> >
> > Grant Ingersoll wrote:
> >
> > > Also, perhaps we should spin off another thread to discuss how to
> > > make DocsWriter easier to maintain.  My biggest concern is
> > > understanding how the various threads work together, and a few
> > > other areas but, like I said, let's spin up a separate thread to
> > > brainstorm what is needed.
> >
> > I agree we should work on simplifying it with time, and, spreading
> > the knowledge of how it works.
> >
> >
> > > Note, that there is some risk in just using wikipedia for profiling
> > > given it's distribution of terms, etc..
> >
> > Good point.  Previously I was using Europarl, but, that corpus is
> > just too fast to index.
> >
> > Are you thinking Wikipedia is somewhat "dirty" (lots of extra terms
> > not normally seen with clean content)?  Since I'm using
> > StandardAnalyzer and not an analyzer based on the new
> > WikipediaTokenizer, I'm getting even extra terms.  Also, I think we'd
> > need an HTMLFilter in the chain since Wikipedia content uses HTML
> > markup.  Grant, what analyzer chain do you use when you index Wikipedia?
> >
> >
> > > I also wonder if using the LineDocMaker is all that realistic a
> > > profiling scenario.  While it is really useful in that it minimizes
> > > IO interaction, etc. I can't help but feel that it isn't at all
> > > close to typical usage.  Most users are not going to have all their
> > > docs rolled up into a single file, 1 doc per line, so I wonder if
> > > we potentially lose insight into how Lucene performs given that
> > > other issues like I/O/memory used for loading files may force the
> > > JVM/Lucene to not have the resources it needs.  Of course, I do
> > > know it is good to try to isolate things so we can focus just on
> > > Lucene, but we also should try to make some accounting for how it
> > > lives in the wild.
> >
> > I agree, this part is not realistic, and the intention is to measure
> > just the indexing time.  In fact I expect most apps spend quite a bit
> > more time building up a Document (filtering binary docs, etc) than
> > actually indexing it.  The only real-world app that I can think of
> > that would be close to LineDocMaker is using Lucene to search big log
> > files, where one line = one Document.
> >
> >
> > > Last, I think it would be good to always attach/check in the .alg
> > > file that is used when running the test, so that others can verify
> > > on different systems/configurations, etc.
> >
> > I did post the alg (under LUCENE-1172).  Though I see I forgot to
> > {code} it and it looks messed up now.  My recent test to try a single
> > quickSort(Object[]) were the same alg, just repeated 10 times instead
> > of 3.
> >
> > But I agree we should always post the alg for all tests...
> >
> >
> >
> >
> > Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message