hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/PoweredBy" by DaveLatham
Date Fri, 26 Jun 2009 17:49:40 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by DaveLatham:
http://wiki.apache.org/hadoop/Hbase/PoweredBy

The comment on the change is:
added Flurry, moved OpenPlaces to alphabetical order

------------------------------------------------------------------------------
  [http://www.adobe.com Adobe] - We currently have about 30 nodes running HDFS, Hadoop and
HBase  in clusters ranging from 5 to 14 nodes on both production and development. We plan
a deployment on an 80 nodes cluster. We are using HBase in several areas from social services
to structured data and processing for internal use. We constantly write data to HBase and
run mapreduce jobs to process then store it back to HBase or external systems. Our production
cluster has been running since Oct 2008.
+ 
+ [http://www.flurry.com Flurry] provides mobile application analytics.  We use HBase and
Hadoop of all of our analytics processing, and serve all of our live requests directly out
of HBase in our production cluster with billions of rows over several tables.
  
  [http://www.mahalo.com Mahalo], "...the world's first human-powered search engine". All
the markup that powers the wiki is stored in HBase. It's been in use for a few months now.
!MediaWiki - the same software that power Wikipedia - has version/revision control. Mahalo's
in-house editors produce a lot of revisions per day, which was not working well in a RDBMS.
An hbase-based solution for this was built and tested, and the data migrated out of MySQL
and into HBase. Right now it's at something like 6 million items in HBase. The upload tool
runs every hour from a shell script to back up that data, and on 6 nodes takes about 5-10
minutes to run - and does not slow down production at all. 
  
+ [http://www.openplaces.org Openplaces] is a search engine for travel that uses HBase to
store terabytes of web pages and travel-related entity records (countries, cities, hotels,
etc.). We have dozens of MapReduce jobs that crunch data on a daily basis.  We use a 20-node
cluster for development, a 40-node cluster for offline production processing and an EC2 cluster
for the live web site. 
  [http://www.powerset.com/ Powerset (a Microsoft company)] uses HBase to store raw documents.
 We have a ~110 node hadoop cluster running DFS, mapreduce, and hbase.  In our wikipedia hbase
table, we have one row for each wikipedia page (~2.5M pages and climbing).  We use this as
input to our indexing jobs, which are run in hadoop mapreduce.  Uploading the entire wikipedia
dump to our cluster takes a couple hours.  Scanning the table inside mapreduce is very fast
-- the latency is in the noise compared to everything else we do.
  
  [http://www.streamy.com/ Streamy] is a recently launched realtime social news site.  We
use HBase for all of our data storage, query, and analysis needs, replacing an existing SQL-based
system.  This includes hundreds of millions of documents, sparse matrices, logs, and everything
else once done in the relational system.  We perform significant in-memory caching of query
results similar to a traditional Memcached/SQL setup as well as other external components
to perform joining and sorting.  We also run thousands of daily MapReduce jobs using HBase
tables for log analysis, attention data processing, and feed crawling.  HBase has helped us
scale and distribute in ways we could not otherwise, and the community has provided consistent
and invaluable assistance.
@@ -22, +25 @@

  
  [http://www.yahoo.com/ Yahoo!] uses HBase to store document fingerprint for detecting near-duplications.
We have a cluster of few nodes that runs HDFS, mapreduce, and HBase. The table contains millions
of rows. We use this for querying duplicated documents with realtime traffic.
  
- [http://www.openplaces.org Openplaces] is a search engine for travel that uses HBase to
store terabytes of web pages and travel-related entity records (countries, cities, hotels,
etc.). We have dozens of MapReduce jobs that crunch data on a daily basis.  We use a 20-node
cluster for development, a 40-node cluster for offline production processing and an EC2 cluster
for the live web site. 
- 

Mime
View raw message