hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/PoweredBy" by BryanMcCormick
Date Thu, 14 Jan 2010 02:56:08 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hbase/PoweredBy" page has been changed by BryanMcCormick.
http://wiki.apache.org/hadoop/Hbase/PoweredBy?action=diff&rev1=40&rev2=41

--------------------------------------------------

  
  [[http://www.powerset.com/|Powerset (a Microsoft company)]] uses HBase to store raw documents.
 We have a ~110 node hadoop cluster running DFS, mapreduce, and hbase.  In our wikipedia hbase
table, we have one row for each wikipedia page (~2.5M pages and climbing).  We use this as
input to our indexing jobs, which are run in hadoop mapreduce.  Uploading the entire wikipedia
dump to our cluster takes a couple hours.  Scanning the table inside mapreduce is very fast
-- the latency is in the noise compared to everything else we do.
  
+ [[http://www.readpath.com/|ReadPath]] uses HBase to store several hundred million RSS items
and dictionary for its RSS newsreader. Readpath is currently running on an 8 node cluster.

+ 
  [[http://www.runa.com/|Runa Inc.]] offers a SaaS that enables online merchants to offer
dynamic per-consumer, per-product promotions embedded in their website. To implement this
we collect the click streams of all their visitors to determine along with the rules of the
merchant what promotion to offer the visitor at different points of their browsing the Merchant
website. So we have lots of data and have to do lots of off-line and real-time analytics.
HBase is the core for us. We also use Clojure and our own open sourced distributed processing
framework, Swarmiji. The HBase Community has been key to our forward movement with HBase.
We're looking for experienced developers to join us to help make things go even faster!
  
  [[http://www.socialmedia.com/|SocialMedia]] uses HBase to store and process user events
which allows us to provide near-realtime user metrics and reporting. HBase forms the heart
of our Advertising Network data storage and management system. We use HBase as a data source
and sink for both realtime request cycle queries and as a backend for mapreduce analysis.
@@ -32, +34 @@

  
  [[http://www.tokenizer.org|Shopping Engine at Tokenizer]] is a web crawler; it uses HBase
to store URLs and Outlinks (!AnchorText + LinkedURL): more than a billion. It was initially
designed as Nutch-Hadoop extension, then (due to very specific 'shopping' scenario) moved
to SOLR + MySQL(InnoDB) (ten thousands queries per second), and now - to HBase. HBase is significantly
faster due to: no need for huge transaction logs, column-oriented design exactly matches 'lazy'
business logic, data compression, !MapReduce support. Number of mutable 'indexes' (term from
RDBMS) significantly reduced due to the fact that each 'row::column' structure is physically
sorted by 'row'. MySQL InnoDB engine is best DB choice for highly-concurrent updates. However,
necessity to flash a block of data to harddrive even if we changed only few bytes is obvious
bottleneck. HBase greatly helps: not-so-popular in modern DBMS 'delete-insert', 'mutable primary
key', and 'natural primary key' patterns become a big advantage with HBase.
  
- [[http://trendmicro.com/|Trend Micro]] uses HBase as a foundation for cloud scale storage
for a variety of applications. We have been developing with HBase since version 0.1 and production
since version 0.20.0. 
+ [[http://trendmicro.com/|Trend Micro]] uses HBase as a foundation for cloud scale storage
for a variety of applications. We have been developing with HBase since version 0.1 and production
since version 0.20.0.
  
  [[http://www.veoh.com/|Veoh Networks]] uses HBase to store and process visitor(human) and
entity(non-human) profiles which are used for behavioral targeting, demographic detection,
and personalization services.  Our site reads this data in real-time (heavily cached) and
submits updates via various batch map/reduce jobs. With 25 million unique visitors a month
storing this data in a traditional RDBMS is not an option. We currently have a 24 node Hadoop/HBase
cluster and our profiling system is sharing this cluster with our other Hadoop data pipeline
processes.
  

Mime
View raw message