lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
Date Fri, 22 Sep 2006 19:20:23 GMT
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436980 ] 
            
Doron Cohen commented on LUCENE-675:
------------------------------------

Few things that would be nice to have in this performance package/framework - 

() indexing only overall time.
() indexing only time changes as the index grows (might be the case that indexing performance
starts to misbehave from a certain size or so).
() search single user while indexing
() search only single user
() search only concurrent users
() short queries
() long queries
() wild card queries
() range queries
() queries with rare words
() queries with common words
() tokenization/analysis only (above indexing measurements include tokenization, but it would
be important to be able to "prove" to oneself that tokenization/analysis time is not hurt
by  a recent change).

() parametric control over:
() () location of test input data.
() () location of output index.
() () location of output log/results.
() ()  total collection size (total number of bytes/characters read from collection)
() () document (average) size (bytes/chars) - test can break input data and recompose it into
documents of desired size.
() () "implicit iteration size" - merge-factor, max-buffered-docs
() () "explicit iteration size" - how often the perf test calls
() () long queries text
() () short queries text
() () which parts of the test framework capabilities to run
() () number of users / threads.
() () queries pace - how many queries are fired in, say, a minute.

Additional points:
() Would help if all test run parameters are maintained in a properties (or xml config) file,
so one can easily modify the test input/output without having to recompile the code.
() Output to allow easy creation of graphs or so - perhaps best would be to have an result
object, so others can easily extend with additional output formats.
() index size as part of output.
() number of index files as part of output (?)
() indexing input module that can loop over the input collection. This allows to test indexing
of a collection larger than the actual input collection being used. 



> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying,
on a known corpus. This issue is intended to collect comments and patches implementing a suite
of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original
Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz
or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I
propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically
retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message