lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections
Date Tue, 01 Feb 2011 23:42:33 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doron Cohen updated LUCENE-1540:
--------------------------------

    Attachment: trecdocs.zip
                LUCENE-1540.patch

updated patch for 3x.

To apply this also copy attached trecdocs.zip under lucene/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds

A test case was added which reads all 5 trec formats with mix of txt/bz2/gz file formats.

I moved unzip() from backcompat test to LuceneTestCase and fixed it to also open directory
hierarchy. 

TrecDocParser is now abstract class as discussed and also the other suggestions by Shai were
followed.

Planning to commit tomorrow if there are no reservations.

> Improvements to contrib.benchmark for TREC collections
> ------------------------------------------------------
>
>                 Key: LUCENE-1540
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1540
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Tim Armstrong
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) are quite
limited and do not support some of the variations in format of older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify the package
to support:
> * Older TREC document formats, which the current parser fails on due to missing document
headers.
> * Variations in query format - newlines after <title> tag causing the query parser
to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to write unit
tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message