hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/PoweredBy" by stack
Date Mon, 02 Jul 2012 20:36:11 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hbase/PoweredBy" page has been changed by stack:

  [[http://www.openplaces.org|Openplaces]] is a search engine for travel that uses HBase to
store terabytes of web pages and travel-related entity records (countries, cities, hotels,
etc.). We have dozens of MapReduce jobs that crunch data on a daily basis.  We use a 20-node
cluster for development, a 40-node cluster for offline production processing and an EC2 cluster
for the live web site.
- [[http://www.powerset.com/|Powerset (a Microsoft company)]] uses HBase to store raw documents.
 We have a ~110 node hadoop cluster running DFS, mapreduce, and hbase.  In our wikipedia hbase
table, we have one row for each wikipedia page (~2.5M pages and climbing).  We use this as
input to our indexing jobs, which are run in hadoop mapreduce.  Uploading the entire wikipedia
dump to our cluster takes a couple hours.  Scanning the table inside mapreduce is very fast
-- the latency is in the noise compared to everything else we do.
+ [[http://www.pnl.gov|Pacific Northwest National Laboratory]] - Hadoop and HBase (Cloudera
distribution) are being used within PNNL's Computational Biology & Bioinformatics Group
for a systems biology data warehouse project that integrates high throughput proteomics and
transcriptomics data sets coming from instruments in the Environmental  Molecular Sciences
Laboratory, a US Department of Energy national user facility located at PNNL. The data sets
are being merged and annotated with other public genomics information in the data warehouse
environment, with Hadoop analysis programs operating on the annotated data in the HBase tables.
This work is hosted by olympus, a large PNNL institutional computing cluster (http://www.pnl.gov/news/release.aspx?id=908)
, with the HBase tables being stored in olympus's Lustre file system.
  [[http://www.readpath.com/|ReadPath]] uses HBase to store several hundred million RSS items
and dictionary for its RSS newsreader. Readpath is currently running on an 8 node cluster.

View raw message