hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Robertson <timrobertson...@gmail.com>
Subject Re: Advice sought for mixed hardware installation
Date Thu, 14 Oct 2010 17:51:26 GMT
>That's a lot of information to digest Tim, so bear with me if I miss
> on some details :)

(ahem) Thanks for taking the time J-D - a lot of information but I
figured it best to just lay it all out there, especially if it helps
others.

I had it in my mind that HBase liked big memory, hence assuming the
region servers should stay on the 24G machines with plenty of memory
at their  disposal.  We'll come up with a test platform and then try
some benchmarking and do a blog on it all and share.

Cheers,
Tim




On Thu, Oct 14, 2010 at 6:57 PM, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
> That's a lot of information to digest Tim, so bear with me if I miss
> on some details :)
>
> a) isn't really good, the big nodes have a lot of computational power
> AND spindles so leaving them like that is a waste, is there's 0
> locality (MR to HBase, HBase to HDFS)
>
> b) sounds weird, would need more time to think about it
>
> and let me propose
>
> c) 10 nodes with HDFS and HBase, the big nodes with HDFS and MR.
>
>  - My main concern in this setup is giving HBase some processing power
> and lots of RAM. In this case you can give 6GB to the RSs, 1GB to the
> DN, and 1GB for the OS (caching, etc).
>  - On the 3 nodes, setup MR so that it uses as many tasks as those
> machines can support (do they have hyper-threading? if so you can even
> use more than 8 tasks). At the same time, the tasks can enjoy a full
> 1GB of heap each.
>  - On locality, HBase will be collocated with DNs so this is great in
> many ways, better than collocating HBase with MR since it's not always
> useful (like on a batch import job, the tasks may use different
> regions at the same time and you cannot predict that... so they still
> go on the network).
>  - On other thing on locality, MR tasks do write intermediate data on
> HDFS so having them collocated with DNs will help.
>
> Regarding the master/NN/ZK, since it's a very small cluster I would
> use one of the small node to collocate the 3 of them (this means you
> will only have 9 RS). You don't really need an ensemble, unless you're
> planning to share that ZK setup with other apps.
>
> In any case, you should test all setups.
>
> J-D
>
> On Thu, Oct 14, 2010 at 4:51 AM, Tim Robertson
> <timrobertson100@gmail.com> wrote:
>> Hi all,
>>
>> We are about to setup a new installation using the following machines,
>> and CDH3 beta 3:
>>
>> - 10 nodes of single quad core, 8GB memory, 2x500GB SATA
>> - 3 nodes of dual quad core, 24GB memory, 6x250GB SATA
>>
>> We are finding our feet, and will blog tests, metrics etc as we go but
>> our initial usage patterns will be:
>>
>> - initial load of 250 million records to HBase
>> - data harvesters pushing 300-600 records per second of insert or
>> update (under 1KB per record) to TABLE_1 in HBase
>> - MR job processing changed content in TABLE_1 into TABLE_2 on an
>> (e.g.) 6 hourly cron job (potentially using co-processors in the
>> future)
>> - MR job processing changed content in TABLE_2 into TABLE_3 on an
>> (e.g.) 6 hourly cron job (potentially using co-processors in the
>> future)
>> - MR jobs building Lucene, SOLR, PostGIS (hive+sqoop) indexes  on a
>> 6,12 or 24 hourly cron job either by
>>  a) bulk export from HBase to .txt and then Hive or custom MR processing
>>  b) hive or custom MR processing straight from HBase tables as the input format
>> - MR jobs building analytical counts (e.g. 4 way "group bys" in SQL
>> using Hive) on 6,12,4 hourly cron either by
>>  a) bulk export from HBase to .txt and then Hive / custom MR processing
>>  b) hive, MR processing straight from HBase tables
>>
>> To give an idea, at the moment on the 10 node cluster Hive against
>> .txt files does full scan in 3-4 minutes (our live system is Mysql and
>> we export to .txt for Hive)
>>
>> I see we have 2 options, but I am inexperienced and seek any guidance:
>>
>> a) run HDFS across all 13 nodes, MR on the 10 small nodes, region
>> servers on the 3 big nodes
>>  - MR will never benefit from data locality when using HBase (? I think)
>> b) run 2 completely separate clusters
>>  clu1: 10 nodes, HDFS, MR
>>  clu2: 3 nodes, HDFS, MR, RegionServer
>>
>> With option b) we would do 6 hourly exports from clu2 -> clu1 and
>> really keep the processing load off the HBase cluster
>>
>> We are prepared to run both, benchmark and provide metrics, but I
>> wonder if someone has some advice beforehand.
>>
>> We are anticipating:
>> - NN, 2nd NN, JT on 3 of the 10 smaller nodes
>> - HBase master on 1 of the 3 big nodes
>> - 1 ZK daemon on 1 of the 3 big nodes (or should we go for an assemble
>> of 3, with one on each)
>>
>> Thanks for any help anyone can provide,
>>
>> Tim
>> (- and Lars F.)
>>
>

Mime
View raw message