hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liu Yan <gzbigegg...@gmail.com>
Subject HBase Performance Tuning
Date Mon, 06 Apr 2009 13:53:39 GMT
I have a 4-node cluster, with the following configuration:

1) master: 7.5G memory, dual-core CPU, running Hadoop NN/DN/TT/JT, HBase
Master and HBase Region Server
2) 2 slaves: 1.7G memory, single-core CPU, running Hadoop DN/TT, and HBase
Region Server

All the DNs on slaves are about 66% usage, while the DN on master is about
36% usage.

mapred.tasktracker.map.tasks.maximum: 12 (master), 4 (slaves)
mapred.tasktracker.reduce.tasks.maximum: 12 (master), 4 (slaves)

I am doing this job: I read a bunch of CSV files (hundreds) recursively from
a specified directory on HDFS, parse the file line by line. The first line
of each file is a "column list" for that particular file. My map task is
used to parse the files line by line, and reduce task is used to write the
parsed result into HBase. The total file size is about 2.6GB.

CSV ==> <NamedRowOffset, Text> == (map) ==>
<ImmutableBytesWritable, HbaseMapWritable<byte[], byte[]>> == (reduce) ==>
<ImmutableBytesWritable, BatchUpdate>

Note: NamedRowOffset is a custom class so we can know the current file name,
column names, etc.

I tried different number of map tasks and reduce tasks, and the total
throughput are different. I am trying to answer:

1) What's the best numbers for map and reduce tasks in my particular
2) Besides the number of map and reduce tasks, do any other parameter(s)
3) What's the common approach to observe and fine tune the parameters
(considering both Hadoop and HBase)?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message