lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections
Date Tue, 01 Feb 2011 07:18:29 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988917#comment-12988917
] 

Doron Cohen edited comment on LUCENE-1540 at 2/1/11 7:18 AM:
-------------------------------------------------------------

Updated patch:
* first the entire doc is read into docBuf, then it is parsed by trecParser
* added trec parser impls for LATimes, FT, FBIS, FR94 - so covering all of Trec-Disks-4+5-minus-CR
collection.
* added a parser by path - it selects which parser to use according to the path of the input
file.

Still not ready to commit but almost there.

With this patch the following alg would index all the 4 dirs, each with its own trec-parser:

{code}
# ----- properties
content.source=org.apache.lucene.benchmark.byTask.feeds.TrecContentSource
content.source.verbose=false
content.source.excludeIteration=true
doc.maker.forever=false
docs.dir=<my-in-dir>
trec.doc.parser=org.apache.lucene.benchmark.byTask.feeds.TrecParserByPath
content.source.log.step=2500
doc.term.vector=false
content.source.forever=false
directory=FSDirectory
work.dir=<my-result-index-dir>
doc.stored=true
doc.body.stored=false
doc.tokenized=true
# ----- alg
ResetSystemErase
CreateIndex
{ AddDoc > : *
CloseIndex
RepAll
{code}

I am thinking of making TrecDocParser an abstract class, and move to it some of the functionality
currently in TrecContentSource / TrecParserByPath.

Also thinking of serving each input file to a separate thread - I think this would allow better
performance when someone indexes in multiple threads - as with current synchronization (we
sync on reading from the file) I got fastest indexing when running sequential, compared again
2,3,4 threads - likely in a separate issue.

      was (Author: doronc):
    Updated patch:
* first the entire doc is read into docBuf, then it is parsed by trecParser
* added trec parser impls for LATimes, FT, FBIS, FR94 - so covering all of Trec-Disks-4+5-minus-CR
collection.
* added a parser by path - it selects which parser to use according to the path of the input
file.

Still not ready to commit but almost there.

With this patch the following alg would index all the 4 dirs, each with its own trec-parser:

{code}
        "# ----- properties ",
        "content.source=org.apache.lucene.benchmark.byTask.feeds.TrecContentSource",
        "content.source.verbose=false",
        "content.source.excludeIteration=true",
        "doc.maker.forever=false",
        "docs.dir=" + inDir.getAbsolutePath().replace('\\','/'), 
        "trec.doc.parser=org.apache.lucene.benchmark.byTask.feeds.TrecParserByPath",
        "content.source.log.step=2500",
        "doc.term.vector=false", 
        "content.source.forever=false",
        "directory="+(outDir==null?"RAMDirectory":"FSDirectory"), 
        (outDir==null ? "# --- no work dir " : "work.dir="+outDir.getAbsolutePath().replace('\\','/')),

        "doc.stored=true", 
        "doc.body.stored=false",
        "doc.tokenized=true",
        "# ----- alg ", 
        "ResetSystemErase", 
        "CreateIndex", 
        "{ AddDoc > : *",
        "CloseIndex",
        "RepAll",
{code}

I am thinking of making TrecDocParser an abstract class, and move to it some of the functionality
currently in TrecContentSource / TrecParserByPath.

Also thinking of serving each input file to a separate thread - I think this would allow better
performance when someone indexes in multiple threads - as with current synchronization (we
sync on reading from the file) I got fastest indexing when running sequential, compared again
2,3,4 threads - likely in a separate issue.
  
> Improvements to contrib.benchmark for TREC collections
> ------------------------------------------------------
>
>                 Key: LUCENE-1540
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1540
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Tim Armstrong
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-1540.patch, LUCENE-1540.patch
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) are quite
limited and do not support some of the variations in format of older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify the package
to support:
> * Older TREC document formats, which the current parser fails on due to missing document
headers.
> * Variations in query format - newlines after <title> tag causing the query parser
to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to write unit
tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message