Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (herse.apache.org: local policy)
Mime-Version: 1.0 (Apple Message framework v752.2)
In-Reply-To: 
 <OF115A5DB0.8CF720D3-ON882572A3.00261592-882572A3.0065B122@il.ibm.com>
References: 
 <OF115A5DB0.8CF720D3-ON882572A3.00261592-882572A3.0065B122@il.ibm.com>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <248F40A1-3D59-420C-A02D-27AB233562E8@apache.org>
Content-Transfer-Encoding: 7bit
From: Grant Ingersoll <gsingers@apache.org>
Subject: Re: contrib/benchmark questions
Date: Mon, 19 Mar 2007 16:10:16 -0400
To: java-user@lucene.apache.org

Thanks for the reply, Doron.  I knew this email was targeted for you,  
but thought it would be good to add to the user record.

On Mar 19, 2007, at 2:30 PM, Doron Cohen wrote:

> Grant Ingersoll <gsingers@apache.org> wrote on 18/03/2007 10:16:14:
>
>> I'm using contrib/benchmark to do some tests for my ApacheCon talk
>> and have some questions.
>>
>> 1. In looking at micro-standard.alg, it seems like not all braces are
>> closed.  Is a line ending a separator too?
>
> '>' can replace as a closing character (alternatively) either '}'  
> or ']'
> with the semantics: "do not collect/report separate statistics for the
> contained tasks. See "Statistic recording elimination" in
> http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/ 
> byTask/package-summary.html

So, if I am understanding correctly:

>> "SearchSameRdr" Search > : 5000

means don't collect indiv. stats fur SearchSameRdr, but do whatever  
that task does 5000 times, right?


>
>> 2. Is there anyway to dump out what params are supported by the
>> various tasks?  I am esp. uncertain on the Search related tasks.
>
> Search related tasks do not take args. Perhaps the task should  
> throw an
> exception if a params is set but not supported. I think I'll add that.
> Currently only AdDoc, DeleteDoc and SetProp take args. The section  
> "Command
> parameter" in
> http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/ 
> byTask/package-summary.html
>  which describes this is incomplete - I will fix it to reflect that.
>
> Which query arguments do you have in mind?

Never mind, I was confused by the : XXXX parameters after the >

>
>> 3. Is there anyway to dump out the stats as a CSV file or something?
>> Would I implement a Task for this?  Ultimately, I want to be able to
>> create a graph in Excel that shows tradeoffs between speed and  
>> memory.
>
> Yes, implementing a report task would be the way.
> ... but when I look at how I implemented these reports, all the  
> work is
> done in the class Points. Seems it should be modified a little with  
> more
> thought of making it easiert to extend reports.

I may take a crack at it, but deadline for the talk is looming


>
>> 4. Is there a way to set how many tabs occur between columns in the
>> final report?  They merge and buffer factors get hard to read for
>> larger values.
>
> There's no general tabbing control, can be added if required, - but  
> for the
> automatically added columns this is not requireed - just modify the  
> name of
> the column and it would fit, e.g. use "merge:10:100" to get a 5  
> charactres
> column, or "merging:10:100" for 7, etc. (Also see "Index work  
> parameters"
> under "Benchmark properties" in
> http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/ 
> byTask/package-summary.html
>
>> 5. Below is my "alg" file, any tips?  What I am trying to do is show
>> the tradeoffs of merge factor and max buffered and how it relates to
>> memory and indexing time.  I want to process all the documents in the
>> Reuters benchmark collection, not the 2000 in the micro-standard.  I
>> don't want any pauses and for now I am happy doing things in serial.
>> I think it is doing what I want, but am not 100% certain.
>>
>
> Yes, it seems correct to me. What I usually do to verify a new alg  
> is to
> run it first with very small numbers - e.g. 10 instead of 22000,  
> etc., and
> examine the log. Few comments:
> - you can specify a larger number than 22000 and the Docmaker will  
> iterate
> and created new docs from same input again.
> - Being intetested in memory stats - the thing that all the rounds  
> run in a
> single program, same JVM run, usually means what you see is very much
> dependent in the GC behavior of the specific VM you are using. If  
> it does
> not release memory (most likely) to the OS you would not be able to  
> notice
> that round i+1 used less memory than round i. It would probably  
> better for
> something like this to put the "round" logic in an ant script,  
> invoking
> each round in a separate new exec. But then things get more  
> complicated for
> having a final stats report containing all rounds. What do you  
> think about
> this?

Good to know.  Perhaps a GarbageCollectionTask is needed?


> - Seems you are only inrerested in the indexing performance, so you  
> can
> remove (or comment out) the search part.
> - If you are intrerested also in the search part, note that as  
> written, the
> four last search related tasks always use a new reader (opening/ 
> closing 950
> readers in this test).

OK, search is the second part, just focused on indexing first.   
Trying to address common questions/issues people have with  
performance in these two areas.

So, I should wrap those task in an OpenReader/CloseReader?

We may also want to consider making this an XML based type  
configuration...

Thanks for your help.  I will probably have a few more questions over  
the next few days.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org