hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Optimizing bulk load performance
Date Thu, 24 Oct 2013 14:14:32 GMT
Hi Harry,

Do you have more details on the exact load? Can you run vmstats and see
what kind of load it is? Is it user? cpu? wio?

I suspect your disks to be the issue. There is 2 things here.

First, we don't recommend RAID for the HDFS/HBase disk. The best is to
simply mount the disks on 2 mounting points and give them to HDFS.
Second, 2 disks per not is very low. On a dev cluster is not even
recommended. In production, you should go with 12 or more.

So with only 2 disks in RAID, I suspect your WIO to be high which is what
might slow your process.

Can you take a look on that direction? If it's not that, we will continue
to investigate ;)



2013/10/23 Harry Waye <hwaye@arachnys.com>

> I'm trying to load data into hbase using HFileOutputFormat and incremental
> bulk load but am getting rather lackluster performance, 10h for ~0.5TB
> data, ~50000 blocks.  This is being loaded into a table that has 2
> families, 9 columns, 2500 regions and is ~10TB in size.  Keys are md5
> hashes and regions are pretty evenly spread.  The majority of time appears
> to be spend in the reduce phase, with the map phase completing very
> quickly.  The network doesn't appear to be saturated, but the load is
> consistently at 6 which is the number or reduce tasks per node.
> 12 hosts (6 cores, 2 disk as RAID0, 1GB eth, no one else on the rack).
> MR conf: 6 mappers, 6 reducers per node.
> I spoke to someone on IRC and they recommended reducing job output
> replication to 1, and reducing the number of mappers which I reduced to 2.
>  Reducing replication appeared not to make any difference, reducing
> reducers appeared just to slow the job down.  I'm going to have a look at
> running the benchmarks mentioned on Michael Noll's blog and see what that
> turns up.  I guess some questions I have are:
> How does the global number/size of blocks affect perf.?  (I have a lot of
> 10mb files, which are the input files)
> How does the job local number/size of input blocks affect perf.?
> What is actually happening in the reduce phase that requires so much CPU?
>  I assume the actual construction of HFiles isn't intensive.
> Ultimately, how can I improve performance?
> Thanks

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message