Improvements to contrib.benchmark for TREC collections
------------------------------------------------------
Key: LUCENE-1540
URL: https://issues.apache.org/jira/browse/LUCENE-1540
Project: Lucene - Java
Issue Type: Improvement
Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Tim Armstrong
Priority: Minor
The benchmarking utilities for TREC test collections (http://trec.nist.gov) are quite limited
and do not support some of the variations in format of older TREC collections.
I have been doing some benchmarking work with Lucene and have had to modify the package to
support:
* Older TREC document formats, which the current parser fails on due to missing document headers.
* Variations in query format - newlines after <title> tag causing the query parser to
get confused.
* Ability to detect and read in uncompressed text collections
* Storage of document numbers by default without storing full text.
I can submit a patch if there is interest, although I will probably want to write unit tests
for the new functionality first.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
|