[ https://issues.apache.org/jira/browse/LUCENE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482367
]
Karl Wettin commented on LUCENE-836:
------------------------------------
Regarding data and user queries, I have a 150 000 document corpus with 4 000 000 queries that
I might be able to convince the owners to release. It is great data, but a bit politically
incorrect (torrents).
There is some simple Wikipedia harvesting in LUCENE-826, and I'm in the middle of rewriting
it to a more general Wikipedia library for text mining purposes. Perhaps you have some ideas
you want to put in there? I plan something like this:
public class WikipediaCorpus {
Map<String, String> wikipediaDomainPrefixByLanguageISO
Map<URL, WikipediaArticle> harvestedArticle
public WikipediaArticle getArticle(String languageISO, String title) {
..
}
}
public class WikipediaArticle {
WikipediaArticle(URL url) {
..
}
String languageISO;
String title;
String[] contentParagraphs
Date[] modified;
Map<String, String> articleInOtherLanguagesByLanguageISO
}
> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
> -----------------------------------------------------------
>
> Key: LUCENE-836
> URL: https://issues.apache.org/jira/browse/LUCENE-836
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Other
> Reporter: Grant Ingersoll
> Priority: Minor
>
> Would be great if the benchmark contrib had a way of providing precision/recall benchmark
information ala TREC. I don't know what the copyright issues are for the TREC queries/data
(I think the queries are available, but not sure about the data), so not sure if the is even
feasible, but I could imagine we could at least incorporate support for it for those who have
access to the data. It has been a long time since I have participated in TREC, so perhaps
someone more familiar w/ the latest can fill in the blanks here.
> Another option is to ask for volunteers to create queries and make judgments for the
Reuters data, but that is a bit more complex and probably not necessary. Even so, an Apache
licensed set of benchmarks may be useful for the community as a whole. Hmmm....
> Wikipedia might be another option instead of Reuters to setup as a download for benchmarking,
as it is quite large and I believe the licensing terms are quite amenable. Having a larger
collection would be good for stressing Lucene more and would give many users a demonstration
of how Lucene handles large collections.
> At any rate, this kind of information could be useful for people looking at different
indexing schemes, formats, payloads and different query strategies.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
|