lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: TrecContentSource and docname/iteration number
Date Fri, 13 Nov 2009 03:54:10 GMT
I think the fix you've made makes sense. The iteration number is added in
case you want to collect more than avail documents (such that it starts over
with the first one). I don't think it has to do with the iterations option
in Benchmark, although it could.

Being able to configure it makes sense to me. What's the default? I
personally don't mind if it would be without iterations ...

BTW, we could decide not to allow configuring it, and only if there is a
second iteration, the code would add _<iter> to the names. So that names
would be DOCID0001 and in the second iteration DOCID0001_0 (or _1).

Shai

On Thu, Nov 12, 2009 at 8:53 PM, Robert Muir <rcmuir@gmail.com> wrote:

> If I use TrecContentSource to index a collection, it puts the doc name into
> the docname field, just as I like.
> say i have a doc with
> <DOCNO>DOCID0001</DOCNO>
> the problem is that concatenates the iteration number to this document
> name:
>
> name = name + "_" + iteration;
>
> this produces a docname of DOCID0001_0, which won't work if I am trying to
> use the quality package to measure relevance.
>
> Does anyone object to changing TrecContentSource to *not do this* ???
> I would think the primary reason you would want to use it would be to
> measure relevance.
>
> alternatively, we could change DocNameExtractor in the quality package to
> ignore this _Iteration suffix... doesn't matter to me.
> --
> Robert Muir
> rcmuir@gmail.com
>

Mime
View raw message