hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/PoweredBy" by CosminLehene
Date Thu, 26 Feb 2009 18:41:13 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by CosminLehene:

The comment on the change is:
Adding information about HBase usage in Adobe

+ [http://www.adobe.com Adobe] - We use a 5 node cluster running HDFS, Hadoop and HBase as
a storage and processing backend for some of our social services. Data is regularly aggregated
using mapreduce jobs and stored back in HBase. Currently an evaluation experiment, the storage
is designed to store around 20-40M rows of structured data. The production cluster has been
running since Oct 2008.
  [http://www.mahalo.com Mahalo], "...the world's first human-powered search engine". All
the markup that powers the wiki is stored in HBase. It's been in use for a few months now.
!MediaWiki - the same software that power Wikipedia - has version/revision control. Mahalo's
in-house editors produce a lot of revisions per day, which was not working well in a RDBMS.
An hbase-based solution for this was built and tested, and the data migrated out of MySQL
and into HBase. Right now it's at something like 6 million items in HBase. The upload tool
runs every hour from a shell script to back up that data, and on 6 nodes takes about 5-10
minutes to run - and does not slow down production at all. 
  [http://www.powerset.com/ Powerset (a Microsoft company)] uses HBase to store raw documents.
 We have a ~70 node hadoop cluster running DFS, mapreduce, and hbase.  In our wikipedia hbase
table, we have one row for each wikipedia page (~2.5M pages and climbing).  We use this as
input to our indexing jobs, which are run in hadoop mapreduce.  Uploading the entire wikipedia
dump to our cluster takes a couple hours.  Scanning the table inside mapreduce is very fast
-- the latency is in the noise compared to everything else we do.

View raw message