hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray" <jl...@streamy.com>
Subject RE: Import into empty table
Date Wed, 11 Mar 2009 18:50:55 GMT
Mat,

Do you have DataNodes hosted on the same machines with RegionServers?

Is this import job running as a MapReduce?

You have 4 maps and 4 reduces per node, plus the DN and the RS.  I'd
recommend at the very least to have 4 cores, or 8 if you have CPU intensive
MR jobs.

Before memory becomes an issue, you're going to be quickly CPU bound between
all three of these things running on a single core (hyperthreaded or not,
even 2 cores may not be sufficient).

I have had some luck with splitting my tables early on in the import, but
this will only make a difference if you have fully randomized the insert
order of your keys, as Ryan pointed out.

Either way, you should probably have max map and reduce tasks set to 1 each
per node.  Or another idea, since you have a decent number of nodes, you
could segment your cluster a bit to prevent starvation and contention
between 4+ jvms on a core.  Run HDFS separate from HBase and MR.  Would have
to know more about what you're trying to do to help you figure out the best
distribution.

JG 

> -----Original Message-----
> From: Mat Hofschen [mailto:hofschen@gmail.com]
> Sent: Wednesday, March 11, 2009 1:15 AM
> To: hbase-user@hadoop.apache.org
> Subject: Import into empty table
> 
> Hi all,
> I am having trouble with importing a medium dataset into an empty new
> table.
> The import runs for about 60 minutes.
> There is a lot of failed/killed tasks in this scenario and sometime the
> import fails altogether.
> 
> If I import a smaller subset into the empty table and then perform
> manual
> split of regions (via split button on webpage) and then import the
> larger
> dataset, the import runs for about 10 minutes.
> 
> It seems to me that the performance bottleneck during the first import
> is
> the single region on the single cluster machine. This machine is
> heavily
> loaded. So my question is whether I can force hbase to split faster
> during
> heavy write operations and what tuning parameters may be affecting this
> scenario.
> 
> Thanks for your help,
> Matthias
> 
> p.s. here are the details
> 
> Details:
> 33 cluster machines in testlab (3 year old servers with hyperthreading
> single core cpu) 1.5 gig of memory, debian 5 lenny 32bit
> hadoop 0.19.0, hbase 0.19.0
> -Xmx 500mb for java processes
> hadoop
> mapred.map.tasks=20
> mapred.reduce.tasks=15
> dfs.block.size=16777216
> mapred.tasktracker.map.tasks.maximum=4
> mapred.tasktracker.reduce.tasks.maximum=4
> 
> hbase
> hbase.hregion.max.filesize=67108864
> 
> hbase table
> 3 column families
> 
> import file
> 5 Mill records with 18 columns with 6 columns per family
> filesize 1.1 gig csv-file
> import via provided java SampleUploader


Mime
View raw message