hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatsuya Kawano <tatsuya6...@gmail.com>
Subject Re: Adding a tiny HBase cluster to existing Hadoop environment
Date Fri, 04 Jun 2010 22:23:54 GMT

Hi Todd, 

Thanks for answering my question. 

> On Thu, Jun 3, 2010 at 5:06 PM, Tatsuya Kawano wrote:
>> I remember Jon was talking other day that he was trying a single HBase
>> server with existing HDFS cluster to serve map reduce (MR) results. I wonder
>> if this went well or not.


>> So I'm thinking to recommend them to add just one server (non-HA) or two
>> servers (HA) to their Hadoop cluster, and run only HMaster and Region Server
>> processes on the server(s). The HBase cluster will utilize the existing
>> (small or large) HDFS cluster and ZooKeeper ensemble.

I went back to the mailing list archive and found that the information I needed was already
there; Jon wrote down pros and cons in a similar configuration. 

RE: HBase on 1 box? how big?
http://markmail.org/thread/3yfoou4gna2fex5f#query:+page:1+mid:4m27ay3mwuh2a5vu+state:results


On 06/04/2010, at 9:37 AM, Todd Lipcon wrote:
> If your "exported dataset" from the MR job is small enough to fit on one
> server, you can certainly use a single HBase RS plus the bulk load
> functionality. However, with a small dataset like that it might make more
> sense to simply export TSV/CSV and then use a tool like Sqoop to export to a
> relational database. That way you'd have better off the shelf integration
> with various other tools or access methods.

Thanks for the suggestion. In this particular configuration, I'm expecting one RS can handle
far larger dataset than typical HBase configuration. The dataset is read-only, so all memstores
will be empty. This leaves more room on the RAM, and the RS could take more regions than usual.
Also, the RS is backed by the current HDFS installation. The larger cluster has more than
50 Data Nodes, and this could give the RS better concurrent random read capacity than a single
node RDB with local hard drives.  

I talked to the guys last night, and one of the guys is also evaluating RDBs (Sybase, Oracle
and MySQL).  His current concern is loading the large dataset to RDB is time consuming. He's
going to try the native import utilities for the RDBs, and Sqoop is on his list too. (He attended
Cloudera Hadoop training in Tokyo.)  But he also wants to try HBase as another option because
it has better MR integration. 


>> Also, I saw Jon's slides for Hadoop World in NYC 2009, and it was said that
>> I'd better to have at least 5 Region Servers / Data Nodes in my cluster to
>> get the typical performance. If I deploy RS and DN on separate servers,
>> which one should be >= 5 nodes? DN? RS? or both?
>> 
>> 
> Better to colocate the DNs and RSs for most deployments. You get
> significantly better random read performance for uncached data.

If I could build the cluster from a scratch, I would suggest so. The difficult part of my
case is the current installations (50+ servers) are not intended to deploy RSs. I need to
add more processor cores and RAM to the current servers to make reliable Task Tracker + DN
+ RS nodes. Also, it's obvious I don't need all 50+ servers to have RS, so maybe five of them?
But having only five region servers on 50+ data nodes results the HDFS data blocks unevenly
distributed across the cluster. This won't be an optimal solution. 

So, in this particular case, I'd rather separate RSs from the DNs to make the data blocks
evenly distributed. I'm not sure if this causes bad performance on random read, because the
network latency in today's hardware is good enough (average 0.1 ms) compared to the server-class
15,000 RPM hard drives (5 ms). The only drawback I can think of is network congestion when
doing massive writes and scans, but my case doesn't do such operations. 


It was good to know that having less than five region servers is not a bad idea (as long as
you have enough number of HDFS data nodes). You and Jon's email gave me some information about
things to avoid, and one of my friends is evaluating RDBs as well. 

Thanks, 
Tatsuya




On 06/04/2010, at 9:37 AM, Todd Lipcon wrote:

> Hi Tatsuya,
> 
> On Thu, Jun 3, 2010 at 5:06 PM, Tatsuya Kawano <tatsuya6502@gmail.com>wrote:
> 
>> Hello,
>> 
>> I remember Jon was talking other day that he was trying a single HBase
>> server with existing HDFS cluster to serve map reduce (MR) results. I wonder
>> if this went well or not.
>> 
>> A couple of friends in Tokyo are considering HBase to do a similar thing.
>> They want to serve MR results inside the clients' companies via HBase. They
>> both have existing MR/HDFS emvironment; one has a small (< 10) and another
>> has a large (> 50) clusters.
>> 
>> They'll use the incremental loading to existing table (HBASE-1923) to add
>> the MR results to the HBase table, and only few users will read and export
>> (web CSV download) the results via HBase. So HBase will be lightly loaded.
>> They probably won't even need high availability (HA) option on HBase.
>> 
>> So I'm thinking to recommend them to add just one server (non-HA) or two
>> servers (HA) to their Hadoop cluster, and run only HMaster and Region Server
>> processes on the server(s). The HBase cluster will utilize the existing
>> (small or large) HDFS cluster and ZooKeeper ensemble.
>> 
>> 
> If your "exported dataset" from the MR job is small enough to fit on one
> server, you can certainly use a single HBase RS plus the bulk load
> functionality. However, with a small dataset like that it might make more
> sense to simply export TSV/CSV and then use a tool like Sqoop to export to a
> relational database. That way you'd have better off the shelf integration
> with various other tools or access methods.
> 
> 
>> The server spec will be 2 x 8-core processors and 8GB to 24GB RAM. The RAM
>> size will be change depending on the data volume and access pattern.
>> 
>> Has anybody tried a similar configuration? and how it goes?
>> 
>> 
>> Also, I saw Jon's slides for Hadoop World in NYC 2009, and it was said that
>> I'd better to have at least 5 Region Servers / Data Nodes in my cluster to
>> get the typical performance. If I deploy RS and DN on separate servers,
>> which one should be >= 5 nodes? DN? RS? or both?
>> 
>> 
> Better to colocate the DNs and RSs for most deployments. You get
> significantly better random read performance for uncached data.
> 
> -Todd
> 
> 
>> 
>> Thanks,
>> Tatsuya Kawano
>> Tokyo, Japan
>> 
>> 
>> 
>> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera


Mime
View raw message