hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pradeep Gollakota <pradeep...@gmail.com>
Subject Re: How to best decide mapper output/reducer input for a huge string?
Date Sat, 21 Sep 2013 06:56:41 GMT
I'm sorry but I don't understand your question. Is the output of the mapper
you're describing the key portion? If it is the key, then your data should
already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this
if you have a need for a non lexical sort order. The GroupingComparator
will tell Hadoop how to group your data for the reducer. All KV-pairs from
the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId,
then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such
as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
easier to write in these languages.

Hope this helps!
- Pradeep

On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pavan0591@gmail.com>wrote:

> I need to improve my MR jobs which uses HBase as source as well as sink..
> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
> them out as one huge string for the reducer to do some computation and dump
> into a HBase Table..
> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
> The output of the mapper is something like this :
> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
> using this technique. I'm not interested in the V part of pair so i'm kind
> of ignoring it. My mapper class is defined as follows:
> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
> For my MR job to be completed, it takes 22 hours to complete which is not
> desirable at all. I'm supposed to optimize this somehow to run a lot faster
> somehow..
> scan.setCaching(750);
> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>                                        Table1,           // input HBase table name
>                                        scan,
>                                        AnalyzeMapper.class,    // mapper
>                                        Text.class,             // mapper output key
>                                        IntWritable.class,      // mapper output value
>                                        job);
>                 TableMapReduceUtil.initTableReducerJob(
>                                         OutputTable,                // output table
>                                         AnalyzeReducerTable.class,  // reducer class
>                                         job);
>                 job.setNumReduceTasks(RegionCount);
> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
> 8 node cloudera cluster.
> Should i use a custom SortComparator or a Group Comparator?
> --
> Regards-
> Pavan

View raw message