Grant Ingersoll wrote:
> Also, perhaps we should spin off another thread to discuss how to
> make DocsWriter easier to maintain. My biggest concern is
> understanding how the various threads work together, and a few
> other areas but, like I said, let's spin up a separate thread to
> brainstorm what is needed.
I agree we should work on simplifying it with time, and, spreading
the knowledge of how it works.
> Note, that there is some risk in just using wikipedia for profiling
> given it's distribution of terms, etc..
Good point. Previously I was using Europarl, but, that corpus is
just too fast to index.
Are you thinking Wikipedia is somewhat "dirty" (lots of extra terms
not normally seen with clean content)? Since I'm using
StandardAnalyzer and not an analyzer based on the new
WikipediaTokenizer, I'm getting even extra terms. Also, I think we'd
need an HTMLFilter in the chain since Wikipedia content uses HTML
markup. Grant, what analyzer chain do you use when you index Wikipedia?
> I also wonder if using the LineDocMaker is all that realistic a
> profiling scenario. While it is really useful in that it minimizes
> IO interaction, etc. I can't help but feel that it isn't at all
> close to typical usage. Most users are not going to have all their
> docs rolled up into a single file, 1 doc per line, so I wonder if
> we potentially lose insight into how Lucene performs given that
> other issues like I/O/memory used for loading files may force the
> JVM/Lucene to not have the resources it needs. Of course, I do
> know it is good to try to isolate things so we can focus just on
> Lucene, but we also should try to make some accounting for how it
> lives in the wild.
I agree, this part is not realistic, and the intention is to measure
just the indexing time. In fact I expect most apps spend quite a bit
more time building up a Document (filtering binary docs, etc) than
actually indexing it. The only real-world app that I can think of
that would be close to LineDocMaker is using Lucene to search big log
files, where one line = one Document.
> Last, I think it would be good to always attach/check in the .alg
> file that is used when running the test, so that others can verify
> on different systems/configurations, etc.
I did post the alg (under LUCENE-1172). Though I see I forgot to
{code} it and it looks messed up now. My recent test to try a single
quickSort(Object[]) were the same alg, just repeated 10 times instead
of 3.
But I agree we should always post the alg for all tests...
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
|