lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections
Date Wed, 02 Feb 2011 09:01:29 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989565#comment-12989565
] 

Doron Cohen commented on LUCENE-1540:
-------------------------------------

Thanks for reviewing Shai!

bq. Maybe instead of moving the unzip method to LuceneTestCase, you can put it as a static
method in _TestUtil? Also, _TestUtil already has a rmDir method, I think we should use it?
I would also do the same for fullTempDir.
Good point, will do.

bq. The method pathType(File f) in TrecDocParser – maybe instead of walking up the path
elements you can obtain its full absolute path (which is a String) and then do indexOf() checks
for the 4 types? It will simplify matters IMO.
Not sure yet if I like better this file separator sensitive approach, I'll take a look.

bq. Typo in TDP: unmodofied --> unmodified.
Will fix.

bq. Maybe we can use String.replaceAll() which takes a regex? This is not critical ...
Right, much simpler this way, will do!

bq. Does stripTags strips off tags of the HTML content? Or is it used for the TREC types other
than GOV2?
It strips any tags, but it is used by parsers which are not using the HTML parser, that is,
the Gov2 one does not use it.

bq. In TrecContentSource, can you replace TrecParserByPath.pathType to TrecDocParser.pathType?
Good catch, this is part of older code, will do.

bq. Also, do we still need TrecParserByPath? I don't see that it's used.
Yes we do, this is an important addition of this patch - allowing you to index trec docs of
several formats. It is used, but dynamically, through the algorithm in TrecContentSourceTest.testTrecFeedDirAllTypes().
So removing it will not break compilation but will fail the tests.

> Improvements to contrib.benchmark for TREC collections
> ------------------------------------------------------
>
>                 Key: LUCENE-1540
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1540
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Tim Armstrong
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) are quite
limited and do not support some of the variations in format of older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify the package
to support:
> * Older TREC document formats, which the current parser fails on due to missing document
headers.
> * Variations in query format - newlines after <title> tag causing the query parser
to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to write unit
tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message