lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Lucene-java Wiki] Update of "OpenRelevance" by AndrzejBialecki
Date Mon, 11 May 2009 20:29:33 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by AndrzejBialecki:

  == Corpora ==
  We have started a preliminary crawl of Creative Commons content using Nutch.  This is currently
hosted on a private machine, but we would like to bring this "in house" to the ASF and have
the ASF host both the crawling and the dissemination of the data.  This, obviously, will need
to be supported by the ASF infrastructure, as it is potentially quite burdensome in terms
of disk space and bandwidth.
+ The crawled content will be collected with only minimal content filtering (e.g. to remove
unhandled media types), that is including also pages that may be potentially regarded as spam,
junk, spider traps, etc. The proposed methodology is to start from a seed list of Creative
Commons sites (or subsets of sites), and crawl exhaustively all linked resources within each
site, linked below the seed root URL, if such resources are covered by CC licenses (and terminate
on nodes that aren't). As a side effect of using Nutch the project will also provide an encoded
web graph in a form of adjacency lists, both incoming and outgoing.
+ == Infrastructure ==
+ In addition to the usual developer resources (SVN, mailing lists, site) the project will
need additional machine and bandwidth resources related to the process of collection, editing
and distribution of corpora and relevance judgments.
+ We would like to collect at least 50 mln pages in the Creative Commons crawl. Assuming that
each page takes ~2kB (compressed), and adding Lucene indexes and temporary space, the project
would need no less than 250 GB of disk space. For Hadoop processing at least two machines
would be needed, although initially a smaller corpus could be processed on a single machine.
In terms of bandwidth, to get the initial data set we likely need to download ~500GB (not
all servers support gzip encoding), including redirects, aliases (the same content reachable
via different urls), etc.
+ Editing of relevance judgments can be performed through a web application, so the infrastructure
needs to provide a servlet container. Search functionality will be also provided by a web
+ Distribution of the corpus is the most demanding aspect of this project. Due to its size
(~100GB) it's not practical to offer this corpus as a traditional download. ''(use P2P ? create
subsets? distribute on HDD ?)''
  == Queries ==

View raw message