Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of bryanck@gmail.com designates
 209.85.160.52 as permitted sender)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: Maximizing throughput
From: Bryan Keller <bryanck@gmail.com>
In-Reply-To: <-7724766211969777899@unknownmsgid>
Date: Fri, 11 Jan 2013 09:37:17 -0800
Content-Transfer-Encoding: quoted-printable
Message-Id: <3053A6C1-CBB4-43C9-87AC-38E3499AF0D2@gmail.com>
References: <37A338FB-327D-4FA6-B616-61A6E181CA0D@gmail.com>
 <-7724766211969777899@unknownmsgid>
To: user@hbase.apache.org

Thanks for the responses. I'm running HBase 0.92.1 (Cloudera CDH4).

The program is very simple, it inserts batches of rows into a table via =
multiple threads. I've tried running it with different parameters =
(column count, threads, batch size, etc.), but throughput didn't =
improve. I've pasted the code here: http://pastebin.com/gPXfdkPy

I have auto flush on (default) as I am inserting rows in batch so don't =
need to use the internal HTable write buffer.

I've posted my config as well: http://pastebin.com/LVG9h6Z4

The regionservers have 12 cores (24 with HT), 128 GB RAM, 6 SCSI drives =
Max throughput is 90-100mb/sec on a drive. I've also tested this on an =
EC2 High I/O instance type with 2 SSDs, 64GB RAM, and 8 cores (16 with =
HT). Both the EC2 and my colo cluster have the same issue of seemingly =
underutilizing resources.

I measure disk usage using iostat and measured the theoretical max using =
hdparm dd. I use iftop to monitor network bandwidth usage, and used =
iperf to test theoretical max. CPU usage I use top and iostat.

The maximum write performance I'm getting is usually around 20mb/sec on =
a drive (this is my colo cluster) on each of the 2 data nodes. That's =
about 20% of the max, and it is only sporadic, not a consistent 20mb/sec =
per drive. Network usage also seems to top out around 20% (200mbit/sec) =
to each node. CPU usage on each node is around 10%. The problem is more =
pronounced on EC2 which has much higher theoretical limits for storage =
and network I/O.

Copying a 133gb file to HDFS looks like it gives similar performance as =
HBase (sporadic disk usage topping out at 20%, low CPU, 30-40% network =
I/O) so it seems this is more of an HDFS issue than an HBase issue.