lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <>
Subject Re: Open Relevance Project?
Date Wed, 13 May 2009 20:38:59 GMT
I followed the whole discussion on how obtaining a certain corpus of
document going on on this thread. I personally think that we should
first define WHAT kind of corpus or rather what kind of different
corpus should be included in this new OpenRelevance project and not
HOW this corpus is collected / aggregated. IR is not just about having
a huge corpus of full-text documents / web-pages especially when it
comes to ranking.

My understanding of OpenRelevance is to provide a set of corpus and
measurement procedures for various use cases not just to compete with
TREC. Please correct me if I'm wrong.
Beyond that the project should help to improve Lucene - Ranking itself
or at least be helpful to obtain a measurement reference for more than
just WebSearch.

Anyway, I personally feel that the discussion about how to obtain a
certain corpus are out of scope at this stage of the project.


On Wed, May 13, 2009 at 9:13 PM, Grant Ingersoll <> wrote:
> On May 13, 2009, at 2:48 PM, Ted Dunning wrote:
>> Crawling a reference dataset requires essentially one-time bandwidth.
> True, but we will likely evolve over time to have multiple datasets, but no
> reason to get ahead of ourselves.
>> Also, it is possible to download, say, wikipedia in a single go.
> Wikipedia isn't always that interesting from a relevance testing standpoint,
> for IR at least (QA, machine learning, etc. it is more so).  A lot of
> queries simply have only one or two relevant results.  While that is useful,
> it is not often the whole picture of what one needs for IR.
>> Likewise
>> there are various web-crawls that are available for research purposes (I
>> think).  See for one example.  These would be
>> single
>> downloads.
>> I don't entirely see the point of redoing the spidering.
> I think we have to be able to control the spidering, so that we can say
> we've vetted what's in it, due to copyright, etc.  But, maybe not.  I've
> talked with quite a few people who have corpora available, and it always
> comes down to copyright for redistribution in a public way.  No one wants to
> assume the risk, even though they all crawl and redistribute (for money).
> For instance, the Internet Archive even goes so far as to apply robots.txt
> retroactively.  We probably could do the same thing, but I'm not sure if it
> is necessary.

View raw message