hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adeel Qureshi <adeelmahm...@gmail.com>
Subject Re: secondary sort - number of reducers
Date Fri, 30 Aug 2013 15:02:05 GMT
my secondary sort on multiple keys seem to work fine with smaller data sets
but with bigger data sets (like 256 gig and 800M+ records) the mapper phase
gets done pretty quick (about 15 mins) but then the reducer phase seem to
take forever. I am using 255 reducers.

basic idea is that my composite key has both group and sort keys in it
which i parse in the appropriate comparator classes to perform grouping and
sorting .. my thinking is that mappers is where most of the work is done
1. mapper itself (create composite key and value)
2. recods sorting
3. partiotioner

if all this gets done in 15 mins then reducer has the simple task of
1. grouping comparator
2. reducer itself (simply output records)

should take less time than mappers .. instead it essentially gets stuck in
reduce phase .. im gonna paste my code here to see if anything stands out
as a fundamental design issue

//////PARTITIONER
public int getPartition(Text key, HCatRecord record, int numReduceTasks) {
//extract the group key from composite key
String groupKey = key.toString().split("\\|")[0];
return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}


////////////GROUP COMAPRATOR
public int compare(WritableComparable a, WritableComparable b) {
//compare to text objects
String thisGroupKey = ((Text) a).toString().split("\\|")[0];
String otherGroupKey = ((Text) b).toString().split("\\|")[0];
//extract
return thisGroupKey.compareTo(otherGroupKey);
}


////////////SORT COMPARATOR
is similar to group comparator and is in map phase and gets done quick



//////////REDUCER
public void reduce(Text key, Iterable<HCatRecord> records, Context context)
throws IOException, InterruptedException {
log.info("in reducer for key " + key.toString());
Iterator<HCatRecord> recordsIter = records.iterator();
//we are only interested in the first record after sorting and grouping
if(recordsIter.hasNext()){
HCatRecord rec = recordsIter.next();
context.write(nw, rec);
log.info("returned record >> " + rec.toString());
}
}


On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <adeelmahmood@gmail.com>wrote:

> yup it was negative and by doing this now it seems to be working fine
>
>
> On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <shekhar2581@gmail.com>wrote:
>
>> Is the hash code of that key  is negative.?
>> Do something like this
>>
>> return groupKey.hashCode() & Integer.MAX_VALUE % numParts;
>>
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>>
>> On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <adeelmahmood@gmail.com>
>> wrote:
>> > okay so when i specify the number of reducers e.g. in my example i m
>> using 4
>> > (for a much smaller data set) it works if I use a single column in my
>> > composite key .. but if I add multiple columns in the composite key
>> > separated by a delimi .. it then throws the illegal partition error
>> (keys
>> > before the pipe are group keys and after the pipe are the sort keys and
>> my
>> > partioner only uses the group keys
>> >
>> > java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
>> > (-1)
>> >         at
>> >
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
>> >         at
>> >
>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
>> >         at
>> >
>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
>> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
>> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>> >         at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> >         at java.security.AccessController.doPrivileged(Native Method)
>> >         at javax.security.auth.Subject.doAs(Subject.java:396)
>> >         at
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> >
>> >
>> > public int getPartition(Text key, HCatRecord record, int numParts) {
>> > //extract the group key from composite key
>> > String groupKey = key.toString().split("\\|")[0];
>> > return groupKey.hashCode() % numParts;
>> > }
>> >
>> >
>> > On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <shekhar2581@gmail.com>
>> > wrote:
>> >>
>> >> No...partitionr decides which keys should go to which reducer...and
>> >> number of reducers you need to decide...No of reducers depends on
>> >> factors like number of key value pair, use case etc
>> >> Regards,
>> >> Som Shekhar Sharma
>> >> +91-8197243810
>> >>
>> >>
>> >> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <adeelmahmood@gmail.com
>> >
>> >> wrote:
>> >> > so it cant figure out an appropriate number of reducers as it does
>> for
>> >> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
>> >> > reducer
>> >> > .. since im overriding the partitioner class shouldnt that decide how
>> >> > manyredeucers there should be based on how many different partition
>> >> > values
>> >> > being returned by the custom partiotioner
>> >> >
>> >> >
>> >> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ian@cloudera.com>
>> wrote:
>> >> >>
>> >> >> If you don't specify the number of Reducers, Hadoop will use the
>> >> >> default
>> >> >> -- which, unless you've changed it, is 1.
>> >> >>
>> >> >> Regards
>> >> >>
>> >> >> Ian.
>> >> >>
>> >> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <adeelmahmood@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >> I have implemented secondary sort in my MR job and for some reason
>> if i
>> >> >> dont specify the number of reducers it uses 1 which doesnt seems
>> right
>> >> >> because im working with 800M+ records and one reducer slows things
>> down
>> >> >> significantly. Is this some kind of limitation with the secondary
>> sort
>> >> >> that
>> >> >> it has to use a single reducer .. that kind of would defeat the
>> purpose
>> >> >> of
>> >> >> having a scalable solution such as secondary sort. I would
>> appreciate
>> >> >> any
>> >> >> help.
>> >> >>
>> >> >> Thanks
>> >> >> Adeel
>> >> >>
>> >> >>
>> >> >>
>> >> >> ---
>> >> >> Ian Wrigley
>> >> >> Sr. Curriculum Manager
>> >> >> Cloudera, Inc
>> >> >> Cell: (323) 819 4075
>> >> >>
>> >> >
>> >
>> >
>>
>
>

Mime
View raw message