From Grant Ingersoll <>
Subject Re: contrib/benchmark questions
Date Mon, 19 Mar 2007 20:10:16 GMT
Thanks for the reply, Doron.  I knew this email was targeted for you,  
but thought it would be good to add to the user record.

On Mar 19, 2007, at 2:30 PM, Doron Cohen wrote:

> Grant Ingersoll <> wrote on 18/03/2007 10:16:14:
>> I'm using contrib/benchmark to do some tests for my ApacheCon talk
>> and have some questions.
>> 1. In looking at micro-standard.alg, it seems like not all braces are
>> closed.  Is a line ending a separator too?
> '>' can replace as a closing character (alternatively) either '}'  
> or ']'
> with the semantics: "do not collect/report separate statistics for the
> contained tasks. See "Statistic recording elimination" in
> byTask/package-summary.html

So, if I am understanding correctly:

>> "SearchSameRdr" Search > : 5000

means don't collect indiv. stats fur SearchSameRdr, but do whatever  
that task does 5000 times, right?

>> 2. Is there anyway to dump out what params are supported by the
>> various tasks?  I am esp. uncertain on the Search related tasks.
> Search related tasks do not take args. Perhaps the task should  
> throw an
> exception if a params is set but not supported. I think I'll add that.
> Currently only AdDoc, DeleteDoc and SetProp take args. The section  
> "Command
> parameter" in
> byTask/package-summary.html
>  which describes this is incomplete - I will fix it to reflect that.
> Which query arguments do you have in mind?

Never mind, I was confused by the : XXXX parameters after the >

>> 3. Is there anyway to dump out the stats as a CSV file or something?
>> Would I implement a Task for this?  Ultimately, I want to be able to
>> create a graph in Excel that shows tradeoffs between speed and  
>> memory.
> Yes, implementing a report task would be the way.
> ... but when I look at how I implemented these reports, all the  
> work is
> done in the class Points. Seems it should be modified a little with  
> more
> thought of making it easiert to extend reports.

I may take a crack at it, but deadline for the talk is looming

>> 4. Is there a way to set how many tabs occur between columns in the
>> final report?  They merge and buffer factors get hard to read for
>> larger values.
> There's no general tabbing control, can be added if required, - but  
> for the
> automatically added columns this is not requireed - just modify the  
> name of
> the column and it would fit, e.g. use "merge:10:100" to get a 5  
> charactres
> column, or "merging:10:100" for 7, etc. (Also see "Index work  
> parameters"
> under "Benchmark properties" in
> byTask/package-summary.html
>> 5. Below is my "alg" file, any tips?  What I am trying to do is show
>> the tradeoffs of merge factor and max buffered and how it relates to
>> memory and indexing time.  I want to process all the documents in the
>> Reuters benchmark collection, not the 2000 in the micro-standard.  I
>> don't want any pauses and for now I am happy doing things in serial.
>> I think it is doing what I want, but am not 100% certain.
> Yes, it seems correct to me. What I usually do to verify a new alg  
> is to
> run it first with very small numbers - e.g. 10 instead of 22000,  
> etc., and
> examine the log. Few comments:
> - you can specify a larger number than 22000 and the Docmaker will  
> iterate
> and created new docs from same input again.
> - Being intetested in memory stats - the thing that all the rounds  
> run in a
> single program, same JVM run, usually means what you see is very much
> dependent in the GC behavior of the specific VM you are using. If  
> it does
> not release memory (most likely) to the OS you would not be able to  
> notice
> that round i+1 used less memory than round i. It would probably  
> better for
> something like this to put the "round" logic in an ant script,  
> invoking
> each round in a separate new exec. But then things get more  
> complicated for
> having a final stats report containing all rounds. What do you  
> think about
> this?

Good to know.  Perhaps a GarbageCollectionTask is needed?

> - Seems you are only inrerested in the indexing performance, so you  
> can
> remove (or comment out) the search part.
> - If you are intrerested also in the search part, note that as  
> written, the
> four last search related tasks always use a new reader (opening/ 
> closing 950
> readers in this test).

OK, search is the second part, just focused on indexing first.   
Trying to address common questions/issues people have with  
performance in these two areas.

So, I should wrap those task in an OpenReader/CloseReader?

We may also want to consider making this an XML based type  

Thanks for your help.  I will probably have a few more questions over  
the next few days.


