hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/PoweredBy" by GeorgeStathis
Date Fri, 08 Jul 2011 19:40:29 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hbase/PoweredBy" page has been changed by GeorgeStathis:
http://wiki.apache.org/hadoop/Hbase/PoweredBy?action=diff&rev1=71&rev2=72

Comment:
Added Traackr

  
  [[http://www.tokenizer.org|Shopping Engine at Tokenizer]] is a web crawler; it uses HBase
to store URLs and Outlinks (!AnchorText + LinkedURL): more than a billion. It was initially
designed as Nutch-Hadoop extension, then (due to very specific 'shopping' scenario) moved
to SOLR + MySQL(InnoDB) (ten thousands queries per second), and now - to HBase. HBase is significantly
faster due to: no need for huge transaction logs, column-oriented design exactly matches 'lazy'
business logic, data compression, !MapReduce support. Number of mutable 'indexes' (term from
RDBMS) significantly reduced due to the fact that each 'row::column' structure is physically
sorted by 'row'. MySQL InnoDB engine is best DB choice for highly-concurrent updates. However,
necessity to flash a block of data to harddrive even if we changed only few bytes is obvious
bottleneck. HBase greatly helps: not-so-popular in modern DBMS 'delete-insert', 'mutable primary
key', and 'natural primary key' patterns become a big advantage with HBase.
  
+ [[http://traackr.com/|Traackr]] uses HBase to store and serve online influencer data in
real-time. We use MapReduce to frequently re-score our entire data set as we keep updating
influencer metrics on a daily basis.
+ 
  [[http://trendmicro.com/|Trend Micro]] uses HBase as a foundation for cloud scale storage
for a variety of applications. We have been developing with HBase since version 0.1 and production
since version 0.20.0.
  
  [[http://www.twitter.com|Twitter]] runs HBase across its entire Hadoop cluster.  HBase provides
a distributed, read/write backup of all  mysql tables in Twitter's production backend, allowing
engineers to run MapReduce jobs over the data while maintaining the ability to apply periodic
row updates (something that is more difficult to do with vanilla HDFS).  A number of applications
including people search rely on HBase internally for data generation. Additionally, the operations
team uses HBase as a timeseries database for cluster-wide monitoring/performance data.

Mime
View raw message