lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Open Relevance Project?
Date Wed, 13 May 2009 19:13:49 GMT

On May 13, 2009, at 2:48 PM, Ted Dunning wrote:

> Crawling a reference dataset requires essentially one-time bandwidth.
>

True, but we will likely evolve over time to have multiple datasets,  
but no reason to get ahead of ourselves.


> Also, it is possible to download, say, wikipedia in a single go.

Wikipedia isn't always that interesting from a relevance testing  
standpoint, for IR at least (QA, machine learning, etc. it is more  
so).  A lot of queries simply have only one or two relevant results.   
While that is useful, it is not often the whole picture of what one  
needs for IR.

> Likewise
> there are various web-crawls that are available for research  
> purposes (I
> think).  See http://webascorpus.org/ for one example.  These would  
> be single
> downloads.
>
> I don't entirely see the point of redoing the spidering.

I think we have to be able to control the spidering, so that we can  
say we've vetted what's in it, due to copyright, etc.  But, maybe  
not.  I've talked with quite a few people who have corpora available,  
and it always comes down to copyright for redistribution in a public  
way.  No one wants to assume the risk, even though they all crawl and  
redistribute (for money).

For instance, the Internet Archive even goes so far as to apply  
robots.txt retroactively.  We probably could do the same thing, but  
I'm not sure if it is necessary.


Mime
View raw message