lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "OpenRelevance" by HossMan
Date Sat, 17 Oct 2009 03:27:57 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The "OpenRelevance" page has been changed by HossMan:
http://wiki.apache.org/lucene-java/OpenRelevance?action=diff&rev1=14&rev2=15

- <<TableOfContents(3)>>
+ This document was the original location for what has now become a Lucene Sub-Project...
  
- = Introduction =
+ http://lucene.apache.org/openrelevance/
  
- The Open Relevance Project is an effort to collect and disseminate free, publicly available
corpora, one or more sets of queries for the corpora and relevance judgments.  We intend to
leverage the power of open source and community to bootstrap this effort.
+ For Information on ORP, please join the mailing lists...
  
+ http://lucene.apache.org/openrelevance/mail.html
- = Motivation =
- 
- While [[http://trec.nist.gov|TREC]] and other conferences provide corpora, query sets and
judgments, none of them do so in a free and open way.  Furthermore, while their distribution
rights would allow the use of the corpora by committers at Apache, they do not allow for wider
dissemination to the community as a whole, which severely hinders qualitative improvements
in relevance in open source search projects.  While we are starting this under the auspices
of the Lucene project, we are by no means aiming to be Lucene only.  Eventually, this could
become its own top level project and go beyond just search to offer collections, etc. for
machine learning and other tracks.
- 
- = Planning =
- 
- This project is not even a project yet, so our first step is to gauge interest and then
make it a project.  Assuming that goes forward, however, we see that there are at least three
parts to moving forward, outlined in the subsections below.
- 
- == Corpora ==
- 
- We have started a preliminary crawl of Creative Commons content using Nutch.  This is currently
hosted on a private machine, but we would like to bring this "in house" to the ASF and have
the ASF host both the crawling and the dissemination of the data.  This, obviously, will need
to be supported by the ASF infrastructure, as it is potentially quite burdensome in terms
of disk space and bandwidth.
- 
- The crawled content will be collected with only minimal content filtering (e.g. to remove
unhandled media types), that is including also pages that may be potentially regarded as spam,
junk, spider traps, etc. The proposed methodology is to start from a seed list of Creative
Commons sites (or subsets of sites), and crawl exhaustively all linked resources within each
site, linked below the seed root URL, if such resources are covered by CC licenses (and terminate
on nodes that aren't). As a side effect of using Nutch the project will also provide an encoded
web graph in a form of adjacency lists, both incoming and outgoing.
- 
- == Infrastructure ==
- 
- In addition to the usual developer resources (SVN, mailing lists, site) the project will
need additional machine and bandwidth resources related to the process of collection, editing
and distribution of corpora and relevance judgments.
- 
- We would like to collect at least 50 mln pages in the Creative Commons crawl. Assuming that
each page takes ~2kB (compressed), and adding Lucene indexes and temporary space, the project
would need no less than 250 GB of disk space. For Hadoop processing at least two machines
would be needed, although initially a smaller corpus could be processed on a single machine.
In terms of bandwidth, to get the initial data set we likely need to download ~500GB (not
all servers support gzip encoding), including redirects, aliases (the same content reachable
via different urls), etc.
- 
- Editing of relevance judgments can be performed through a web application, so the infrastructure
needs to provide a servlet container. Search functionality will be also provided by a web
application.
- 
- Distribution of the corpus is the most demanding aspect of this project. Due to its size
(~100GB) it's not practical to offer this corpus as a traditional download. ''(use P2P ? create
subsets? distribute on HDD ?)''.  Amazon S3 and EBS (via [[http://aws.amazon.com/publicdatasets/|Amazon
Public Datasets]]) are efficient & cheap options for distributing larger datasets.  Uploading
to a public S3 bucket is the easiest option, and automatically [[http://docs.amazonwebservices.com/AmazonS3/2006-03-01/index.html?S3Torrent.html|makes
uploaded data available via torrent]]. Datasets up to 1 TB [[http://www.datawrangling.com/wikipedia-page-traffic-statistics-dataset|can
also be distributed]] via free public EBS volumes.
- 
- == Queries ==
- 
- We believe we can crowdsource the query effort simply by asking people to generate queries
for the collection via a wiki page that anyone can edit.  While this could result in gaming
in the early stages, we believe over time the query set will stabilize.  Depending on the
user privacy agreement, it might be possible for Wikipedia to make a set of search query referrals
available from server logs (without associated user information).  Any personally identifiable
information in the queries (SSNs, etc.) could be scrubbed, although it would be unlikely these
queries would lead to search clicks on Wikipedia.
- 
- == Relevance Judgments ==
- 
- Unlike TREC, we will focus only on relevance judgments for the top ten or twenty results.
 We will need to figure out a way to "pool" and/or validate the results.  Again, a wikipedia
still approach may work here.  '''NOTE: There was a SIGIR poster/paper a little while ago
about crowd-sourcing relevance judgments.  Link to it here.'''
- 
- * ''Is this it?'': [[http://portal.acm.org/citation.cfm?id=1390450|Relevance judgments between
TREC and Non-TREC assessors]] ([[http://ir.shef.ac.uk/cloughie/papers/pp908-almaskari.pdf|Alternate
link]])
- 
- * ''I think its this one:'' Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing
for relevance evaluation. SIGIR Forum, 42(2), 9-15. doi: 10.1145/1480506.1480508.:[[http://doi.acm.org/10.1145/1480506.1480508]]
- 
- * ''Look out for ACM SIGIR 2009 paper on relevance judgements as a game.''
  
  
- = Next Steps =
+ (This document has migrated to: http://cwiki.apache.org/ORP/orp-background.html)
  
-  * Attract volunteers
-  * Get infrastructure backing
-  * Academic involvement?  SIGIR?  Others?
- 
- = History =
- 
- This topic has come up a number of times in the past.  Grant Ingersoll has corresponded
with the lead of TREC, Ellen Voorhees, and with the Univ. of Glasgow (keepers of some of the
TREC documents, see http://ir.dcs.gla.ac.uk/test_collections/) about either obtaining TREC
resources (even if the ASF has to pay) or creating a truly open collection.  See http://www.lucidimagination.com/search/document/656d5ca50c8c9242/trec_collection_nist_and_lucene,
specifically http://www.lucidimagination.com/search/document/656d5ca50c8c9242/trec_collection_nist_and_lucene#84e9e24ee9ff4779.
 On the Glasgow side, all conversations there were private, and thus not available.  The gist
of them is that the only possibility for distribution is to a limited number of committers
within the ASF on a project by project basis (even though the ASF is purchasing) and that
the burden is on us to maintain a list of people who have access to the documents.  Ultimately,
while the ASF was still willing to make the purchase, Grant felt it was too much of a burden
and not beneficial to Open Source and has thus not proceeded with obtaining the collection.
- 
- = Initial Committers =
- 
- While we think most of the work can be done via a wiki and that there isn't really going
to be any "releases" per se, we may still have some tools, etc. that we check into SVN, etc.
- 
-  * Grant Ingersoll
-  * Andrzej Bialecki
-  * Simon Willnauer
-  * Otis Gospodnetić
- 

Mime
View raw message