lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Open Relevance Project?
Date Mon, 11 May 2009 20:46:20 GMT
Michael McCandless wrote:

> I think it's actually quite a challenge to do well.  EG it's easy to
> make a corpus that's too easy because it's highly diverse (and thus
> most search engines have no trouble pulling back relevant results).
> Instead, I think the content set should be well & tightly scoped to a
> certain topic, and not necessarily that large (ie we don't need a huge
> number of documents).  It would help if that scoping is towards
> content that many people find "of interest" so we get "accurate"
> judgements by as wide an audience as possible.
> EG how about coverage of the 2009 H1N1 outbreak (that's licensed
> appropriately)?  Or... the 2008 US presidential election?  Or...
> research on Leukemia (but I fear such content is not typically
> licensed appropriately, nor will it have wide interest).

These are good ideas. It's difficult not only to collect a meaningful 
corpus, but also later to distribute it, if it weighs a hundred GBs or more.

> What does "using Nutch to crawl Creative Commons" actually mean?  Can
> I browse the content that's being crawled?

Yes. It's easy to collect a lot of web pages starting from a seed list 
and expanding the crawling frontier to linked resources, while applying 
CC license filters. Nutch provides a lot of tools out of the box that we 
need anyway, such as keeping track of page status, following outlinks, 
parsing, working with web graph (important for scoring web documents), 
indexing, searching and content browsing.

> Also, to help us build up the relevance judgements, I think we should
> build a basic custom app for collecting queries as well as annotating
> them.  I should be able to go to that page and run my own queries,
> which are collected.  Then, I should be able to browse previously
> collected queries, click on them, and add my own judgement.  The site
> should try to offer up queries that are "in need" of judgements.  It
> should run the search and let me step through the results, marking
> those that are relevant; but we would then bias the results to that
> search engine; maybe under the hood we rotate through search engines
> each time?

Comparing results across search engines is clearly a challenge. Among 
others, this requires that the corpus that we use with the engines that 
we operate (Lucene? KinoSearch? other open source engines?) contains at 
least top-X (where X > N) URL-s returned from external engines for every 
query - otherwise we won't be able to compare the results.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message