hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hbase/PoweredBy" by JimKellerman
Date Fri, 03 Oct 2008 23:17:05 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by JimKellerman:
http://wiki.apache.org/hadoop/Hbase/PoweredBy

The comment on the change is:
Escape some words that look like Wiki words but are not.

------------------------------------------------------------------------------
- [http://www.mahalo.com Mahalo], "...the world's first human-powered search engine". All
the markup that powers the wiki is stored in HBase. It's been in use for a few months now.
MediaWiki - the same software that power Wikipedia - has version/revision control. Mahalo's
in-house editors produce a lot of revisions per day, which was not working well in a RDBMS.
An hbase-based solution for this was built and tested, and the data migrated out of MySQL
and into HBase. Right now it's at something like 6 million items in HBase. The upload tool
runs every hour from a shell script to back up that data, and on 6 nodes takes about 5-10
minutes to run - and does not slow down production at all. 
+ [http://www.mahalo.com Mahalo], "...the world's first human-powered search engine". All
the markup that powers the wiki is stored in HBase. It's been in use for a few months now.
!MediaWiki - the same software that power Wikipedia - has version/revision control. Mahalo's
in-house editors produce a lot of revisions per day, which was not working well in a RDBMS.
An hbase-based solution for this was built and tested, and the data migrated out of MySQL
and into HBase. Right now it's at something like 6 million items in HBase. The upload tool
runs every hour from a shell script to back up that data, and on 6 nodes takes about 5-10
minutes to run - and does not slow down production at all. 
  
  [http://www.powerset.com/ Powerset (a Microsoft company)] uses HBase to store raw documents.
 We have a ~70 node hadoop cluster running DFS, mapreduce, and hbase.  In our wikipedia hbase
table, we have one row for each wikipedia page (~2.5M pages and climbing).  We use this as
input to our indexing jobs, which are run in hadoop mapreduce.  Uploading the entire wikipedia
dump to our cluster takes a couple hours.  Scanning the table inside mapreduce is very fast
-- the latency is in the noise compared to everything else we do.
  
- [http://www.tokenizer.org Shopping Engine at Tokenizer] is a web crawler; it uses HBase
to store URLs and Outlinks (AnchorText + LinkedURL): more than a billion. It was initially
designed as Nutch-Hadoop extension, then (due to very specific 'shopping' scenario) moved
to SOLR + MySQL(InnoDB) (ten thousands queries per second), and now - to HBase. HBase is significantly
faster due to: no need for huge transaction logs, column-oriented design exactly matches 'lazy'
business logic, data compression, MapReduce support. Number of mutable 'indexes' (term from
RDBMS) significantly reduced due to the fact that each 'row::column' structure is physically
sorted by 'row'. MySQL InnoDB engine is best DB choice for highly-concurrent updates. However,
necessity to flash a block of data to harddrive even if we changed only few bytes is obvious
bottleneck. HBase greatly helps: not-so-popular in modern DBMS 'delete-insert', 'mutable primary
key', and 'natural primary key' patterns become a 
 big advantage with HBase.
+ [http://www.tokenizer.org Shopping Engine at Tokenizer] is a web crawler; it uses HBase
to store URLs and Outlinks (!AnchorText + LinkedURL): more than a billion. It was initially
designed as Nutch-Hadoop extension, then (due to very specific 'shopping' scenario) moved
to SOLR + MySQL(InnoDB) (ten thousands queries per second), and now - to HBase. HBase is significantly
faster due to: no need for huge transaction logs, column-oriented design exactly matches 'lazy'
business logic, data compression, !MapReduce support. Number of mutable 'indexes' (term from
RDBMS) significantly reduced due to the fact that each 'row::column' structure is physically
sorted by 'row'. MySQL InnoDB engine is best DB choice for highly-concurrent updates. However,
necessity to flash a block of data to harddrive even if we changed only few bytes is obvious
bottleneck. HBase greatly helps: not-so-popular in modern DBMS 'delete-insert', 'mutable primary
key', and 'natural primary key' patterns become 
 a big advantage with HBase.
  
- [http://trendmicro.com/ Trend Micro] Advanced Threats Research is running Hadoop 0.18.1
and HBase 0.18.0. Our application is a web crawling application with concurrent batch content
analysis of various kinds. All of the workflow components are implemented as subclasses of
TableMap and/or TableReduce on a cluster of 25 nodes. We see a constant rate of 2500 requests/sec
or greater, peaking periodically near 100K/sec when some of the batch scan tasks run.
+ [http://trendmicro.com/ Trend Micro] Advanced Threats Research is running Hadoop 0.18.1
and HBase 0.18.0. Our application is a web crawling application with concurrent batch content
analysis of various kinds. All of the workflow components are implemented as subclasses of
!TableMap and/or !TableReduce on a cluster of 25 nodes. We see a constant rate of 2500 requests/sec
or greater, peaking periodically near 100K/sec when some of the batch scan tasks run.
  
  [http://www.videosurf.com/ VideoSurf] - "The video search engine that has taught computers
to see". We're using Hbase to persist various large graphs of data and other statistics. Hbase
was a real win for us because it let us store substantially larger datasets without the need
for manually partitioning the data and it's column-oriented nature allowed us to create schemas
that were substantially more efficient for storing and retrieving data.
  

Mime
View raw message