hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/PoweredBy" by RyanRawson
Date Thu, 10 Sep 2009 23:26:45 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by RyanRawson:

  [http://www.streamy.com/ Streamy] is a recently launched realtime social news site.  We
use HBase for all of our data storage, query, and analysis needs, replacing an existing SQL-based
system.  This includes hundreds of millions of documents, sparse matrices, logs, and everything
else once done in the relational system.  We perform significant in-memory caching of query
results similar to a traditional Memcached/SQL setup as well as other external components
to perform joining and sorting.  We also run thousands of daily MapReduce jobs using HBase
tables for log analysis, attention data processing, and feed crawling.  HBase has helped us
scale and distribute in ways we could not otherwise, and the community has provided consistent
and invaluable assistance.
+ [http://www.stumbleupon.com/ Stumbleupon] and [http://su.pr Su.pr] use HBase as a real time
data storage and analytics platform. Serving directly out of HBase, various site features
and statistics are kept up to date in a real time fashion. We also use HBase a map-reduce
data source to overcome traditional query speed limits in MySQL. 
  [http://www.subrecord.org SubRecord Project] is an Open Source project that is using HBase
as a repository of records (persisted map-like data) for the aspects it provides like logging,
tracing or metrics. HBase and Lucene index both constitute a repo/storage for this platform.
  [http://www.tokenizer.org Shopping Engine at Tokenizer] is a web crawler; it uses HBase
to store URLs and Outlinks (!AnchorText + LinkedURL): more than a billion. It was initially
designed as Nutch-Hadoop extension, then (due to very specific 'shopping' scenario) moved
to SOLR + MySQL(InnoDB) (ten thousands queries per second), and now - to HBase. HBase is significantly
faster due to: no need for huge transaction logs, column-oriented design exactly matches 'lazy'
business logic, data compression, !MapReduce support. Number of mutable 'indexes' (term from
RDBMS) significantly reduced due to the fact that each 'row::column' structure is physically
sorted by 'row'. MySQL InnoDB engine is best DB choice for highly-concurrent updates. However,
necessity to flash a block of data to harddrive even if we changed only few bytes is obvious
bottleneck. HBase greatly helps: not-so-popular in modern DBMS 'delete-insert', 'mutable primary
key', and 'natural primary key' patterns become 
 a big advantage with HBase.

View raw message