Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of tatsuya6502@gmail.com
 designates 74.125.83.169 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=content-type:mime-version:subject:from:in-reply-to:date
         :content-transfer-encoding:message-id:references:to:x-mailer;
        b=WnHX3ySzkh7z23o7Np4BAUzwCgKM05ASIfUX+rSMEKHVOvKoOYj8JY7+e3AeC94yN1
         GeXmACBMIaiWd4+nRz+txMIGHsPCTLQzObZ5vR21K4PvW6VjlMhHdu3x0W7jo2cGZg8A
         Lv0Ay2IQ1m8kcbTfYSetzqfOnWqgvNXmwO0LM=
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1078)
Subject: Re: Adding a tiny HBase cluster to existing Hadoop environment
From: Tatsuya Kawano <tatsuya6502@gmail.com>
In-Reply-To: <AANLkTinja-kZJYGnAPVvzWcqfaiQepRlDiceyuQ9j4r0@mail.gmail.com>
Date: Sat, 5 Jun 2010 07:23:54 +0900
Content-Transfer-Encoding: quoted-printable
Message-Id: <B76E1F35-55EE-4310-9B93-32164F35261D@gmail.com>
References: <C0E01B0C-9B52-4F3C-BED1-7653CE0B3C07@gmail.com>
 <AANLkTinja-kZJYGnAPVvzWcqfaiQepRlDiceyuQ9j4r0@mail.gmail.com>
To: user@hbase.apache.org


Hi Todd,=20

Thanks for answering my question.=20

> On Thu, Jun 3, 2010 at 5:06 PM, Tatsuya Kawano wrote:
>> I remember Jon was talking other day that he was trying a single =
HBase
>> server with existing HDFS cluster to serve map reduce (MR) results. I =
wonder
>> if this went well or not.


>> So I'm thinking to recommend them to add just one server (non-HA) or =
two
>> servers (HA) to their Hadoop cluster, and run only HMaster and Region =
Server
>> processes on the server(s). The HBase cluster will utilize the =
existing
>> (small or large) HDFS cluster and ZooKeeper ensemble.

I went back to the mailing list archive and found that the information I =
needed was already there; Jon wrote down pros and cons in a similar =
configuration.=20

RE: HBase on 1 box? how big?
=
http://markmail.org/thread/3yfoou4gna2fex5f#query:+page:1+mid:4m27ay3mwuh2=
a5vu+state:results


On 06/04/2010, at 9:37 AM, Todd Lipcon wrote:
> If your "exported dataset" from the MR job is small enough to fit on =
one
> server, you can certainly use a single HBase RS plus the bulk load
> functionality. However, with a small dataset like that it might make =
more
> sense to simply export TSV/CSV and then use a tool like Sqoop to =
export to a
> relational database. That way you'd have better off the shelf =
integration
> with various other tools or access methods.

Thanks for the suggestion. In this particular configuration, I'm =
expecting one RS can handle far larger dataset than typical HBase =
configuration. The dataset is read-only, so all memstores will be empty. =
This leaves more room on the RAM, and the RS could take more regions =
than usual. Also, the RS is backed by the current HDFS installation. The =
larger cluster has more than 50 Data Nodes, and this could give the RS =
better concurrent random read capacity than a single node RDB with local =
hard drives. =20

I talked to the guys last night, and one of the guys is also evaluating =
RDBs (Sybase, Oracle and MySQL).  His current concern is loading the =
large dataset to RDB is time consuming. He's going to try the native =
import utilities for the RDBs, and Sqoop is on his list too. (He =
attended Cloudera Hadoop training in Tokyo.)  But he also wants to try =
HBase as another option because it has better MR integration.=20


>> Also, I saw Jon's slides for Hadoop World in NYC 2009, and it was =
said that
>> I'd better to have at least 5 Region Servers / Data Nodes in my =
cluster to
>> get the typical performance. If I deploy RS and DN on separate =
servers,
>> which one should be >=3D 5 nodes? DN? RS? or both?
>>=20
>>=20
> Better to colocate the DNs and RSs for most deployments. You get
> significantly better random read performance for uncached data.

If I could build the cluster from a scratch, I would suggest so. The =
difficult part of my case is the current installations (50+ servers) are =
not intended to deploy RSs. I need to add more processor cores and RAM =
to the current servers to make reliable Task Tracker + DN + RS nodes. =
Also, it's obvious I don't need all 50+ servers to have RS, so maybe =
five of them? But having only five region servers on 50+ data nodes =
results the HDFS data blocks unevenly distributed across the cluster. =
This won't be an optimal solution.=20

So, in this particular case, I'd rather separate RSs from the DNs to =
make the data blocks evenly distributed. I'm not sure if this causes bad =
performance on random read, because the network latency in today's =
hardware is good enough (average 0.1 ms) compared to the server-class =
15,000 RPM hard drives (5 ms). The only drawback I can think of is =
network congestion when doing massive writes and scans, but my case =
doesn't do such operations.=20


It was good to know that having less than five region servers is not a =
bad idea (as long as you have enough number of HDFS data nodes). You and =
Jon's email gave me some information about things to avoid, and one of =
my friends is evaluating RDBs as well.=20

Thanks,=20
Tatsuya


On 06/04/2010, at 9:37 AM, Todd Lipcon wrote:

> Hi Tatsuya,
>=20
> On Thu, Jun 3, 2010 at 5:06 PM, Tatsuya Kawano =
<tatsuya6502@gmail.com>wrote:
>=20
>> Hello,
>>=20
>> I remember Jon was talking other day that he was trying a single =
HBase
>> server with existing HDFS cluster to serve map reduce (MR) results. I =
wonder
>> if this went well or not.
>>=20
>> A couple of friends in Tokyo are considering HBase to do a similar =
thing.
>> They want to serve MR results inside the clients' companies via =
HBase. They
>> both have existing MR/HDFS emvironment; one has a small (< 10) and =
another
>> has a large (> 50) clusters.
>>=20
>> They'll use the incremental loading to existing table (HBASE-1923) to =
add
>> the MR results to the HBase table, and only few users will read and =
export
>> (web CSV download) the results via HBase. So HBase will be lightly =
loaded.
>> They probably won't even need high availability (HA) option on HBase.
>>=20
>> So I'm thinking to recommend them to add just one server (non-HA) or =
two
>> servers (HA) to their Hadoop cluster, and run only HMaster and Region =
Server
>> processes on the server(s). The HBase cluster will utilize the =
existing
>> (small or large) HDFS cluster and ZooKeeper ensemble.
>>=20
>>=20
> If your "exported dataset" from the MR job is small enough to fit on =
one
> server, you can certainly use a single HBase RS plus the bulk load
> functionality. However, with a small dataset like that it might make =
more
> sense to simply export TSV/CSV and then use a tool like Sqoop to =
export to a
> relational database. That way you'd have better off the shelf =
integration
> with various other tools or access methods.
>=20
>=20
>> The server spec will be 2 x 8-core processors and 8GB to 24GB RAM. =
The RAM
>> size will be change depending on the data volume and access pattern.
>>=20
>> Has anybody tried a similar configuration? and how it goes?
>>=20
>>=20
>> Also, I saw Jon's slides for Hadoop World in NYC 2009, and it was =
said that
>> I'd better to have at least 5 Region Servers / Data Nodes in my =
cluster to
>> get the typical performance. If I deploy RS and DN on separate =
servers,
>> which one should be >=3D 5 nodes? DN? RS? or both?
>>=20
>>=20
> Better to colocate the DNs and RSs for most deployments. You get
> significantly better random read performance for uncached data.
>=20
> -Todd
>=20
>=20
>>=20
>> Thanks,
>> Tatsuya Kawano
>> Tokyo, Japan
>>=20
>>=20
>>=20
>>=20
>=20
>=20
> --=20
> Todd Lipcon
> Software Engineer, Cloudera