lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: [jira] Commented: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
Date Tue, 20 Mar 2007 13:30:56 GMT
I think the Reuters corpus is pretty good and it pretty well known in  
the community.  Probably the most important part would be to build up  
a set of judgments.  I don't think it is too hard to come up w/  
50-100 questions/queries, but creating the relevance pool will be  
more difficult.  I suppose we could setup a social networking site to  
harvest judgments... :-)

The 4M queries would be good for load testing.

Wikipedia stuff is good, but you need to be able to handle/remove the  
redirects, otherwise you have a tendency to get redirect pages as  
your top matches due to length normalization.  Plus it is really big  
to download.

On Mar 20, 2007, at 6:58 AM, Karl Wettin (JIRA) wrote:

>     [ 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel#action_12482367 ]
> Karl Wettin commented on LUCENE-836:
> ------------------------------------
> Regarding data and user queries, I have a 150 000 document corpus  
> with 4 000 000 queries that I might be able to convince the owners  
> to release. It is great data, but a bit politically incorrect  
> (torrents).
> There is some simple Wikipedia harvesting in LUCENE-826, and I'm in  
> the middle of rewriting it to a more general Wikipedia library for  
> text mining purposes. Perhaps you have some ideas you want to put  
> in there? I plan something like this:
> public class WikipediaCorpus {
>   Map<String, String> wikipediaDomainPrefixByLanguageISO
>   Map<URL, WikipediaArticle> harvestedArticle
>   public WikipediaArticle getArticle(String languageISO, String  
> title) {
>     ..
>   }
> }
> public class WikipediaArticle {
>   WikipediaArticle(URL url) {
>     ..
>   }
>   String languageISO;
>   String title;
>   String[] contentParagraphs
>   Date[] modified;
>   Map<String, String> articleInOtherLanguagesByLanguageISO
> }
>> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
>> -----------------------------------------------------------
>>                 Key: LUCENE-836
>>                 URL:
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: Other
>>            Reporter: Grant Ingersoll
>>            Priority: Minor
>> Would be great if the benchmark contrib had a way of providing  
>> precision/recall benchmark information ala TREC.  I don't know  
>> what the copyright issues are for the TREC queries/data (I think  
>> the queries are available, but not sure about the data), so not  
>> sure if the is even feasible, but I could imagine we could at  
>> least incorporate support for it for those who have access to the  
>> data.  It has been a long time since I have participated in TREC,  
>> so perhaps someone more familiar w/ the latest can fill in the  
>> blanks here.
>> Another option is to ask for volunteers to create queries and make  
>> judgments for the Reuters data, but that is a bit more complex and  
>> probably not necessary.  Even so, an Apache licensed set of  
>> benchmarks may be useful for the community as a whole.  Hmmm....
>> Wikipedia might be another option instead of Reuters to setup as a  
>> download for benchmarking, as it is quite large and I believe the  
>> licensing terms are quite amenable.  Having a larger collection  
>> would be good for stressing Lucene more and would give many users  
>> a demonstration of how Lucene handles large collections.
>> At any rate, this kind of information could be useful for people  
>> looking at different indexing schemes, formats, payloads and  
>> different query strategies.
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message