hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pavan Sudheendra <pavan0...@gmail.com>
Subject Re: How to best decide mapper output/reducer input for a huge string?
Date Sat, 21 Sep 2013 07:04:08 GMT
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of
<K,V> is not of much use to me.. But i'm hoping to change that if it leads
to faster execution.. I'm kind of a newbie so looking to make the
map/reduce job run a lot faster..

Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
seems if i write a map output for each and every row of a 19 m row HBase
table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this
via a MR job.. So, cannot use either of that.. Do you recommend me to try
something if i have the data in that format?

On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pradeepg26@gmail.com>wrote:

> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.
> Hope this helps!
> - Pradeep
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pavan0591@gmail.com>wrote:
>> I need to improve my MR jobs which uses HBase as source as well as sink..
>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>> them out as one huge string for the reducer to do some computation and dump
>> into a HBase Table..
>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>> The output of the mapper is something like this :
>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
>> using this technique. I'm not interested in the V part of pair so i'm kind
>> of ignoring it. My mapper class is defined as follows:
>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> {
>> For my MR job to be completed, it takes 22 hours to complete which is not
>> desirable at all. I'm supposed to optimize this somehow to run a lot faster
>> somehow..
>> scan.setCaching(750);
>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>                                        Table1,           // input HBase table name
>>                                        scan,
>>                                        AnalyzeMapper.class,    // mapper
>>                                        Text.class,             // mapper output key
>>                                        IntWritable.class,      // mapper output value
>>                                        job);
>>                 TableMapReduceUtil.initTableReducerJob(
>>                                         OutputTable,                // output table
>>                                         AnalyzeReducerTable.class,  // reducer class
>>                                         job);
>>                 job.setNumReduceTasks(RegionCount);
>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>> a 8 node cloudera cluster.
>> Should i use a custom SortComparator or a Group Comparator?
>> --
>> Regards-
>> Pavan


View raw message