hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: more noob questions--how/when is data 'distributed' across a cluster?
Date Fri, 04 Apr 2008 20:16:20 GMT

On 4/4/08 11:48 AM, "Paul Danese" <thebusyant@gmail.com> wrote:
> [ ... Extract and report on 25,000 out of 10^6 records ...]
> So...at my naive level, this seems like a decent job for hadoop.
> ***QUESTION 1: Is this an accurate belief?***

Sounds just right.

On 10 loser machines, it is feasible to expect to be able to scan 100MB per
second so if your records (with associated data) is about 1GB, it should
take about 10 seconds to pass over your data.  With allowances for system
reality you should be able to do something interesting in a minute or so.

> [ ... Map does projection and object tagging, reduce does reporting ...]]]

Your process outline looks fine.

> ***QUESTION 2:  If the answer to Q1 is "yes", how does one typically "move"
> data from a rdbms to something like HDFS/HBase?***

There are copy commands.  Typically, you partition your data by date of
insertion and dump new records into new files.  Occasionally, you might
merge older records to limit the number of total files.

I don't see any need for Hbase here, but since the two things that Hbase
does really well are insertion and table scan, it would be pretty natural to
use.  You should check with the hbase guys to see how long it will take to
produce the output you want, but I would expect that you could get your
25000 rows faster than a raw scan using hadoop alone.

> ***QUESTION 3:  Am I right in thinking that my HBase data are going to be
> denormalized relative to my RDBMS?***

Very likely.

> ***QUESTION 5:  Or is this whole problem something better addressed by some
> type of high-performance rdbms cluster?***

This would be a very modest sized Oracle data set, but I would guess that
the costs would make Hadoop preferable.  It isn't even all that large for
mySQL.  The existence of very nice reporting software for either of these
could tip the balance the other way, however.

> ***QUESTION 6:  Is there a detailed (step by step) tutorial on how to use
> HBase w/ Hadoop?***

I think  you could start with hadoop alone.

View raw message