hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Something Something <mailinglist...@gmail.com>
Subject Re: Performance related question
Date Tue, 15 Dec 2009 17:14:08 GMT
Thanks J-D & Mtohiko for the tips.  Significant improvement in performance,
but there's still room for improvement.  In my local pseudo distributed mode
the 2 map reduce jobs now run in less than 4 minutes (from 32 mins) and in
cluster of 10 nodes + 5 zk nodes they run in 11 minutes (down from 1 hour &
30 mins).  But still I would like to come to a point where they run faster
on a cluster than on my local machine.

Here's what I did:

1)  Fixed a bug in my code that was causing unnecessary writes to HBase.
2)  Added these two lines after creating 'new HTable':
3)  Added this line after Put:
4)  Added this line (only when running on cluster):

There are other 64-bit related improvements which I cannot try; mainly
because Amazon charges (way) too much for 64-bit machines.  It costs me over
$25 for 15 machines for less than 3 hours, so I switched to 'm1.small'
32-bit machines.  Of course, one of the promises of the distributed
computing is that we will be able to use "cheap commodity hardware", right
:)  So I would like to stick with 'm1.small' for now.  (But I am willing to
use about 30 machines if that's going to help.)

Anyway, I have noticed that one of my Mappers is taking too long.  If anyone
would share ideas of how to improve Mapper speed, that would be greatly
appreciated.  Basically, in this Mapper I read about 50,000 rows from a
HBase table using TableMapReduceUtil.initTableMapperJob() and do some
complex processing for "values" of each row.  I don't write anything back in
HBase, but I do write quite a few lines (context.write()) to HDFS.  Any

Thanks once again for the help.

2009/12/13 <motohiko.mouri@justsystems.com>

> Hello,
> Something Something <mailinglists19@gmail.com> wrote´╝Ü
> > PS:  One thing I have noticed is that it goes to 66% very fast and then
> > slows down from there..
> It seems that only one reducer works. You should increase reduce tasks.
> The default reduce task's number is written on
> hadoop/docs/mapred-default.html.
> The default parameter of mapred.reduce.tasks is 1. So only one reduce task
> runs.
> There are two ways to increase reduce tasks:
> 1. Use Job.setNumReduceTasks(int tasks) on your MapReduce job file.
> 2. Denote more mapred.reduce.tasks on hadoop/conf/mapred-site.xml.
> You can get the best perfomance if you run 20 reduce tasks. The detail of
> the number
> of reduce tasks is written on
> http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Reducer
> at "How many Reduces?" as J-D wrote. Notice that
> JobConf.setNumReduceTasks(int) is
> already deprecated, so you should use Job.setNumReduceTasks(int tasks)
> rather than
> JobConf.setNumReduceTasks(int).
> --
> Motohiko Mouri

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message