lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
Date Tue, 07 Nov 2006 10:50:55 GMT
     [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]

Doron Cohen updated LUCENE-675:
-------------------------------

    Attachment: timedata.zip

I tried it and it is working nice! - 
1st run downloaded the documents from the Web before starting to index. 
2nd run started right off - as input docs are already in place - great. 

Seems the only output is what is printed to stdout, right? 

I got something like this: 
----------------------------
     [echo] Working Directory: work
     [java] Testing 4 different permutations.
     [java] #-- ID: td-00_10_10, Sun Nov 05 22:40:49 PST 2006, heap=1065484288 --
     [java] # source=work\reuters-out, directory=org.apache.lucene.store.FSDirectory@D:\devoss\lucene\java\trunk\contrib\benchmark\work\index
     [java] # maxBufferedDocs=10, mergeFactor=10, compound=true, optimize=true
     [java] # Query data: R-reopen, W-warmup, T-retrieve, N-no
     [java] # qd-0110 R W NT [body:salomon]
     [java] # qd-0111 R W T [body:salomon]
     [java] # qd-0100 R NW NT [body:salomon]
...
     [java] # qd-14011 NR W T [body:fo*]
     [java] # qd-14000 NR NW NT [body:fo*]
     [java] # qd-14001 NR NW T [body:fo*]

     [java] Start Time: Sun Nov 05 22:41:38 PST 2006
     [java]  - processed 500, run id=0
     [java]  - processed 1000, run id=0
     [java]  - processed 1500, run id=0
     [java]  - processed 2000, run id=0
     [java] End Time: Sun Nov 05 22:41:48 PST 2006
     [java] warm = Warm Index Reader
     [java] srch = Search Index
     [java] trav = Traverse Hits list, optionally retrieving document

     [java] # testData id	operation	runCnt	recCnt	rec/s	avgFreeMem	avgTotalMem
     [java] td-00_100_100	addDocument	1	2000	472.0321	4493681	22611558
     [java] td-00_100_100	optimize	1	1	2.857143	4229488	22716416
     [java] td-00_100_100	qd-0110-warm	1	2000	40000.0	4250992	22716416
     [java] td-00_100_100	qd-0110-srch	1	1	Infinity	4221288	22716416
...
     [java] td-00_100_100	qd-4110-srch	1	1	Infinity	3993624	22716416
     [java] td-00_100_100	qd-4110-trav	1	0	NaN	3993624	22716416
     [java] td-00_100_100	qd-4111-warm	1	2000	50000.0	3853192	22716416
...
BUILD SUCCESSFUL
Total time: 1 minute 0 seconds
----------------------------

I think the "infinity" and "NAN" are caused by op time too short for divide-by-sec.
This can be avoided by modifying getRate() in TimeData:
  public double getRate() {
    double rps = (double) count * 1000.0 / (double) (elapsed>0 ? elapsed : 1);
    return rps;
  }

I like much the logic of loading test data from the Web, and the scaleUp and maximumDocumentsToIndex
params are handy. 

It seems that all the test logic and some of its data (queries) are java coded. I initially
thought of a setting where we define tasks/jobs that are parameterized, like:

- createIndex(params)
- writeToIndex(params):
  - addDocs()
  - optimize()
- readFromIndex(params):
  - searchIndex()
  - fetchData()

..and compose a test by an XML that says which of these simple jobs to run, with what params,
in which order, serial/parallel, how long/often etc. 
Then creating a different test is as easy as creating a different XML that configures that
test. 

On the other hand, chances are, I know, that most useful cases would be those already defined
here - standard and micro-standard, so can ask "why bothering changing to define these building
blocks". I am not sure here, but thought I'll bring it up. 

About Using the driver - seems nice and clean to me. I don't know the Digester but it seems
to read the config from the XML correctly.

Other comments:
1. I think there is a redundant call to params.showRunData(params.getId()) in runBenchmark(File,Options);
2. Seems that rec/sec would be a bit more accurately computed by aggregating elapsed times
(instead of rate) in showRunData()
3. If TimeData not found (only memData) I think additional 0.0 should be printed
4. columns allignments with tabs and floats is imperfect.:-)
5. It would be nice I think to also get a summary of the results by "task" - e.g. srch, optimize,
something like:
     [java] # testData id     operation           runCnt     recCnt          rec/s       avgFreeMem
     avgTotalMem
     [java]                   warm                    60       2000       42,628.8       
8,235,758       23,048,192
     [java]                   srch                   120          1          571.4       
8,300,613       23,048,192
     [java]                   optimize                 1          1            2.9       
9,375,732       23,048,192
     [java]                   trav                   120        107       30,517.8       
8,326,046       23,048,192
     [java]                   addDocument              1       2000          441.8       
7,310,929       22,206,872

Attached timedata.zip has modifies TimeData.java and TestData.java for [1 to 5] above, and
for the NAN/inifinite. 

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java,
LuceneIndexer.java, timedata.zip
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying,
on a known corpus. This issue is intended to collect comments and patches implementing a suite
of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original
Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz
or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I
propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically
retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message