hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pavan Sudheendra <pavan0...@gmail.com>
Subject Re: How to best decide mapper output/reducer input for a huge string?
Date Sat, 21 Sep 2013 08:17:26 GMT
No, I don't have a combiner in place. Is it necessary? How do I make my map
output compressed? Yes, the Tables in HBase are compressed.

Although, there's no real bottleneck, the time it takes to process the
entire table is huge. I have to constantly check if i can optimize it

Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
any thing wrong with my design? Does it require any kind of re-work? Thank
you so much for helping..

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <pradeepg26@gmail.com>wrote:

> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?
> Have you been able to profile your code to see where the bottlenecks are?
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <pavan0591@gmail.com>wrote:
>> Hi Pradeep,
>> Yes.. Basically i'm only writing the key part as the map output.. The V
>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>> leads to faster execution.. I'm kind of a newbie so looking to make the
>> map/reduce job run a lot faster..
>> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
>> seems if i write a map output for each and every row of a 19 m row HBase
>> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>> something if i have the data in that format?
>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <pradeepg26@gmail.com
>> > wrote:
>>> I'm sorry but I don't understand your question. Is the output of the
>>> mapper you're describing the key portion? If it is the key, then your data
>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>> this if you have a need for a non lexical sort order. The
>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>> All KV-pairs from the same group will be given to the same Reducer.
>>> If your reduce computation needs all the KV-pairs for the same
>>> HouseHoldId, then you will need to write a GroupingComparator.
>>> Also, have you considered using a higher level abstraction on Hadoop
>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>> a LOT easier to write in these languages.
>>> Hope this helps!
>>> - Pradeep
>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <pavan0591@gmail.com>wrote:
>>>> I need to improve my MR jobs which uses HBase as source as well as
>>>> sink..
>>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>>> them out as one huge string for the reducer to do some computation and dump
>>>> into a HBase Table..
>>>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>>> The output of the mapper is something like this :
>>>> HouseHoldId contentID name duration genre type channelId personId televisionID
>>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>>> kind of ignoring it. My mapper class is defined as follows:
>>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable>
{ }
>>>> For my MR job to be completed, it takes 22 hours to complete which is
>>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>>> faster somehow..
>>>> scan.setCaching(750);
>>>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>>>                                        Table1,           // input HBase table
>>>>                                        scan,
>>>>                                        AnalyzeMapper.class,    // mapper
>>>>                                        Text.class,             // mapper
output key
>>>>                                        IntWritable.class,      // mapper
output value
>>>>                                        job);
>>>>                 TableMapReduceUtil.initTableReducerJob(
>>>>                                         OutputTable,                // output
>>>>                                         AnalyzeReducerTable.class,  // reducer
>>>>                                         job);
>>>>                 job.setNumReduceTasks(RegionCount);
>>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are
>>>> running a 8 node cloudera cluster.
>>>> Should i use a custom SortComparator or a Group Comparator?
>>>> --
>>>> Regards-
>>>> Pavan
>> --
>> Regards-
>> Pavan


View raw message