lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Apache logs and data
Date Tue, 20 Nov 2007 19:28:09 GMT
karl wettin wrote:
> On Nov 15, 2007 10:09 PM, Grant Ingersoll <gsingers@apache.org> wrote:
>> it is always good to have query logs
> 
> I realize that it is not that politically correct, but the TPB
> collection is released to the public domain and contains 3.2 million
> user queries with session id, timestamp, category etc to go with the
> 150,000+500,000 documents.
> 
> 
> http://thepiratebay.org/tor/3783572

That's a good find!  They use Lucene too!

I don't see any legal issues to us writing code that parses these files. 
  To be safest, I don't think we should republish the files, or even any 
of the queries, but I don't think we should need to.  Folks can download 
them to their own machines and use them for testing there.

It doesn't look as though there's click data, so we can't use this for 
relevance experiments without manually creating judgments.  But for 
performance benchmarking it could be useful.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message