lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Open Relevance Project?
Date Mon, 11 May 2009 20:01:02 GMT
I'd love to see a resource like this (it's high time!), and I'll try
to help when/where I can, starting with some initial

I think it's actually quite a challenge to do well.  EG it's easy to
make a corpus that's too easy because it's highly diverse (and thus
most search engines have no trouble pulling back relevant results).
Instead, I think the content set should be well & tightly scoped to a
certain topic, and not necessarily that large (ie we don't need a huge
number of documents).  It would help if that scoping is towards
content that many people find "of interest" so we get "accurate"
judgements by as wide an audience as possible.

EG how about coverage of the 2009 H1N1 outbreak (that's licensed
appropriately)?  Or... the 2008 US presidential election?  Or...
research on Leukemia (but I fear such content is not typically
licensed appropriately, nor will it have wide interest).

What does "using Nutch to crawl Creative Commons" actually mean?  Can
I browse the content that's being crawled?

Also, to help us build up the relevance judgements, I think we should
build a basic custom app for collecting queries as well as annotating
them.  I should be able to go to that page and run my own queries,
which are collected.  Then, I should be able to browse previously
collected queries, click on them, and add my own judgement.  The site
should try to offer up queries that are "in need" of judgements.  It
should run the search and let me step through the results, marking
those that are relevant; but we would then bias the results to that
search engine; maybe under the hood we rotate through search engines
each time?

Do we have anyone involved who's built similar corpora before?  Or has
anyone read papers on how prior corpora were designed/created?


On Mon, May 11, 2009 at 12:07 PM, Grant Ingersoll <> wrote:
> A few of us who are interested in an Open Relevance assessment project (ala
> TREC) have started to put some thoughts down on "paper" over at
> Thus, if you'd like to somehow participate (TBD what that actually means
> just yet) in developing a set of open collections, queries and assessments
> for relevance testing, let's discuss here and on that Wiki page.
> The basic gist of it is, we'd like to crawl Creative Commons and/or other
> free content, redistribute it along with queries and judgments, thus fueling
> the testing capabilities to further improve Lucene's search quality as well
> as, of course, providing the means for a completely open assessment process
> whereby anyone can participate without having to fork up money to license 20
> year old copyrighted news articles that are of no other value whatsoever
> other than testing.
> At this point, we're open to a lot of ideas.  Once we solidify a bit, then
> we'd like to make it an official Lucene subproject and get our own resources
> as well as figure out how to crawl and host the content using ASF
> infrastructure (without making the ASF infra. team upset!)
> Cheers,
> Grant

View raw message