lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <>
Subject [jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark
Date Tue, 31 Jul 2007 22:07:55 GMT


Doron Cohen commented on LUCENE-967:

Applies cleanly and all test pass (running from contrib/benchmark.)

I like the efficiency changes.

A few suggestions:

  1) in ReadTokensTask change doLogic() to return the number of tokens 
       processed in that specific call to doLogic() (differs from tokensCount 
       which aggregates all calls).

  2) in TestPerfTaskLogic the comment in testReadTokens seems 
      copy/pasted from testLineDocFile and should be changed. 

      - Also (I am not sure if it is worth your time, but) to really test it, you 
      could open a reader against the created index and verify the number 
      of docs, and also the index sum-of-DF comparing to the total tokens 
      counts numbers in ReadTokensTask. 

> Add "tokenize documents only" task to contrib/benchmark
> -------------------------------------------------------
>                 Key: LUCENE-967
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>         Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message