hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Danese" <thebusy...@gmail.com>
Subject more noob questions--how/when is data 'distributed' across a cluster?
Date Fri, 04 Apr 2008 18:48:44 GMT

Currently I have a large (for me) amount of data stored in a relational
database (3 tables: each with 2 - 10 million related records. This is an
oversimplification, but for clarity it's close enough).

There is a relatively simple Object-relational Mapping (ORM) to my
database:  Specifically, my parent Object is called "Accident".
Accidents can have 1 or more Report objects (has many).
Reports can have 1 or more Outcome objects (has many).

Each of these Objects maps to a specific table in my RDBMS w/ foreign keys
'connecting' records between tables.

I run searches against this database (using Lucene) and this works quite
well as long as I return only *subsets* of the *total* result-set at any one
e.g. I may have 25,000 hits ("Accidents") that meet my threshold Lucene
score, but as long as I only query the database for 50 Accident "objects" at
any one time, the response time is great.

The 'problem' is that I'd also like to use those 25,000 Accidents to
generate an electronic report as **quickly as possible**
(right now it takes about 30 minutes to collect all 25,000 hits from the
database, extract the relevant fields and construct the actual report).
Most of this 30 minutes is spent hitting the database and
processing/extracting the relevant data (generating the report is rather
fast once all the data are properly formatted).

So...at my naive level, this seems like a decent job for hadoop.
***QUESTION 1: Is this an accurate belief?***

i.e., I have a semi-large collection of key/value pairs (25,000 Accident IDs
would be the keys, and 25,000 Accident objects would be values)

These object/value pairs are "mapped" on a cluster, extracting the relevant
data from each object.
The mapping then releases a new set of "key/value" pairs (in this case,
emitted keys are one of three categories (accident, report, outcome) and the
values are arrays of accident, report and outcome data that will go into the

These emitted key/value pairs are then "reduced" and resulting reduced
collections are used to build the report.

***QUESTION 2:  If the answer to Q1 is "yes", how does one typically "move"
data from a rdbms to something like HDFS/HBase?***
***QUESTION 3:  Am I right in thinking that my HBase data are going to be
denormalized relative to my RDBMS?***
***QUESTION 4:  How are the data within an HBase database *actually*
distributed amongst nodes?  i.e. is the distribution done automatically upon
creating the db (as long as the cluster already exists?)  Or do you have to
issue some type of command that says "okay...here's the HBase db, distribute
it to nodes a - z"***
***QUESTION 5:  Or is this whole problem something better addressed by some
type of high-performance rdbms cluster?***
***QUESTION 6:  Is there a detailed (step by step) tutorial on how to use
HBase w/ Hadoop?***

Anyway, apologies if this is the 1000th time this has been answered and
thank you for any insight!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message