hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pavan Sudheendra <pavan0...@gmail.com>
Subject How to best decide mapper output/reducer input for a huge string?
Date Sat, 21 Sep 2013 06:32:55 GMT
I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them
out as one huge string for the reducer to do some computation and dump into
a HBase Table..

Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId
televisionID timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm
using this technique. I'm not interested in the V part of pair so i'm kind
of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is not
desirable at all. I'm supposed to optimize this somehow to run a lot faster

scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
                                       Table1,           // input
HBase table name
                                       AnalyzeMapper.class,    // mapper
                                       Text.class,             //
mapper output key
                                       IntWritable.class,      //
mapper output value

                                        OutputTable,                //
output table
                                        AnalyzeReducerTable.class,  //
reducer class

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
8 node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator?


View raw message