hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@yahoo.com>
Subject RE: [hbase] table HRegionServer affinity
Date Tue, 08 Jan 2008 19:04:39 GMT
Thank you all for your considered responses. I would like
to make a few comments and clarifications (inline, below). 

Jim Kellerman wrote:
> > Andrew Purtell wrote:
> > First: How invasive to the HBase master/region model is
> > the concept of specifying constraints on data mobility?
> It would be very disruptive. The current model is that
> you run one or more HBase clusters per HDFS cluster. An
> HBase cluster does not span HDFS clusters.
> As far as I know HDFS clusters do not span data centers.
> Latency and network partitioning would be big problems
> for a system that requires sub-second response times.

I was not suggesting spanning HDFS, only HBase, spanned
across several HDFS clusters. The configuration, for
example, might look like:

   master in US
   region server and local HDFS cluster in US
   region server and local HDFS cluster in EU
   region server and local HDFS cluster in APAC

where each region server is backed by a local HDFS cluster
on a gigabit backplane, and in each region globally
distributed map-reduce jobs execute with data-driven
regional differences. Yet, at the same time, jobs in any
given region can query rows generated within another via
globally distributed/available table(s).

I have set up this configuration in the lab using 0.15.1
(compiled by hadoopqa from revision 596497), even with
artificial latency introduced to simulate international
links, and I can say that it works for me. It may only work
by accident. Also, my testing thus far has been rather
limited: e.g. create table on one cluster, then insert on
another, then select from a third, etc.

Fault tolerance considerations due to an elevated risk of
network partition are of course an issue. Allowing modified
region servers to continue serving explicitly partitioned
tables in the extended absence of communication with the
master might be a first-cut option, but I suspect you'd
take a dim view of this: perhaps "pollution" of a clean
model with hacking.

Sub-second response times should not be a problem because
in addition to constraints on data mobility we'd use query
extensions to limit query scope to the region(s) where the
data is known to reside for the bulk of map-reduce

> A change such as this would require major changes to the
> architecture and our vision of the model going forward.
> (replication between data centers and a single table
> residing in multiple data centers being served by
> separate HBase instances running on separate HDFS
> clusters).

And I thank you for this, and also for the -1 from Edward,
as it is instructive as to how divergent our ideas for
using HBase might be from the community, at least with
respect to what amounts to cluster federation. Anyway, at
this time, we are only considering these things.

Best regards,

Andrew Purtell
Advanced Threats Research
Trend Micro, Inc, Pasadena, CA USA
(personal mail)

Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping

View raw message