hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 java8964 <java8...@hotmail.com>
Subject RE: secondary sort - number of reducers
Date Fri, 30 Aug 2013 15:58:04 GMT
Well, The reducers normally will take much longer than the mappers stage, because the copy/shuffle/sort
all happened at this time, and they are the hard part.
But before we simply say it is part of life, you need to dig into more of your MR jobs to
find out if you can make it faster.
You are the person most familiar with your data, and you wrote the code to group/partition
them, and send them to the reducers. Even you set up 255 reducers, the question is, do each
of them get its fair share?You need to read the COUNTER information of each reducer, and found
out how many reducer groups each reducer gets, and how many input bytes it get, etc.
Simple example, if you send 200G data, and group them by DATE, if all the data belongs to
2 days, and one of them contains 90% of data, then in this case, giving 255 reducers won't
help, as only 2 reducers will consume data, and one of them will consume 90% of data, and
will finish in a very long time, which WILL delay the whole MR job, while the rest reducers
will finish within seconds. In this case, maybe you need to rethink what should be your key,
and make sure each reducer get its fair share of volume of data.
After the above fix (in fact, normally it will fix 90% of reducer performance problems, especially
you have 255 reducer tasks available, so each one average will only get 1G data, good for
your huge cluster only needs to process 256G data :-), if you want to make it even faster,
then check you code. Do you have to use String.compareTo()? Is it slow?  Google hadoop rawcomparator
to see if you can do something here.
After that, if you still think the reducer stage slow, check you cluster system. Does the
reducer spend most time on copy stage, or sort, or in your reducer class? Find out the where
the time spends, then identify the solution.

Date: Fri, 30 Aug 2013 11:02:05 -0400
Subject: Re: secondary sort - number of reducers
From: adeelmahmood@gmail.com
To: user@hadoop.apache.org

my secondary sort on multiple keys seem to work fine with smaller data sets but with bigger
data sets (like 256 gig and 800M+ records) the mapper phase gets done pretty quick (about
15 mins) but then the reducer phase seem to take forever. I am using 255 reducers.

basic idea is that my composite key has both group and sort keys in it which i parse in the
appropriate comparator classes to perform grouping and sorting .. my thinking is that mappers
is where most of the work is done 
1. mapper itself (create composite key and value)2. recods sorting3. partiotioner
if all this gets done in 15 mins then reducer has the simple task of1. grouping comparator
2. reducer itself (simply output records)
should take less time than mappers .. instead it essentially gets stuck in reduce phase ..
im gonna paste my code here to see if anything stands out as a fundamental design issue

//////PARTITIONERpublic int getPartition(Text key, HCatRecord record, int numReduceTasks)
{		//extract the group key from composite key
		String groupKey = key.toString().split("\\|")[0];				return (groupKey.hashCode() & Integer.MAX_VALUE)
% numReduceTasks;

////////////GROUP COMAPRATORpublic int compare(WritableComparable a, WritableComparable b)
{		//compare to text objects
		String thisGroupKey = ((Text) a).toString().split("\\|")[0];		String otherGroupKey = ((Text)
		//extract 		return thisGroupKey.compareTo(otherGroupKey);	}

////////////SORT COMPARATOR is similar to group comparator and is in map phase and gets done

public void reduce(Text key, Iterable<HCatRecord> records, Context context) throws IOException,
InterruptedException {		log.info("in reducer for key " + key.toString());
		Iterator<HCatRecord> recordsIter = records.iterator();		//we are only interested in
the first record after sorting and grouping
		if(recordsIter.hasNext()){			HCatRecord rec = recordsIter.next();			context.write(nw, rec);
			log.info("returned record >> " + rec.toString());		}	}

On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <adeelmahmood@gmail.com> wrote:

yup it was negative and by doing this now it seems to be working fine

On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <shekhar2581@gmail.com> wrote:

Is the hash code of that key  is negative.?

Do something like this

return groupKey.hashCode() & Integer.MAX_VALUE % numParts;


Som Shekhar Sharma


On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <adeelmahmood@gmail.com> wrote:

> okay so when i specify the number of reducers e.g. in my example i m using 4

> (for a much smaller data set) it works if I use a single column in my

> composite key .. but if I add multiple columns in the composite key

> separated by a delimi .. it then throws the illegal partition error (keys

> before the pipe are group keys and after the pipe are the sort keys and my

> partioner only uses the group keys


> java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel

> (-1)

>         at

> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)

>         at

> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)

>         at

> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)

>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)

>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)

>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)

>         at java.security.AccessController.doPrivileged(Native Method)

>         at javax.security.auth.Subject.doAs(Subject.java:396)

>         at

> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)

>         at org.apache.hadoop.mapred.Child.main(Child.java:249)



> public int getPartition(Text key, HCatRecord record, int numParts) {

> //extract the group key from composite key

> String groupKey = key.toString().split("\\|")[0];

> return groupKey.hashCode() % numParts;

> }



> On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <shekhar2581@gmail.com>

> wrote:


>> No...partitionr decides which keys should go to which reducer...and

>> number of reducers you need to decide...No of reducers depends on

>> factors like number of key value pair, use case etc

>> Regards,

>> Som Shekhar Sharma

>> +91-8197243810



>> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <adeelmahmood@gmail.com>

>> wrote:

>> > so it cant figure out an appropriate number of reducers as it does for

>> > mappers .. in my case hadoop is using 2100+ mappers and then only 1

>> > reducer

>> > .. since im overriding the partitioner class shouldnt that decide how

>> > manyredeucers there should be based on how many different partition

>> > values

>> > being returned by the custom partiotioner

>> >

>> >

>> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ian@cloudera.com> wrote:

>> >>

>> >> If you don't specify the number of Reducers, Hadoop will use the

>> >> default

>> >> -- which, unless you've changed it, is 1.

>> >>

>> >> Regards

>> >>

>> >> Ian.

>> >>

>> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <adeelmahmood@gmail.com>

>> >> wrote:

>> >>

>> >> I have implemented secondary sort in my MR job and for some reason if i

>> >> dont specify the number of reducers it uses 1 which doesnt seems right

>> >> because im working with 800M+ records and one reducer slows things down

>> >> significantly. Is this some kind of limitation with the secondary sort

>> >> that

>> >> it has to use a single reducer .. that kind of would defeat the purpose

>> >> of

>> >> having a scalable solution such as secondary sort. I would appreciate

>> >> any

>> >> help.

>> >>

>> >> Thanks

>> >> Adeel

>> >>

>> >>

>> >>

>> >> ---

>> >> Ian Wrigley

>> >> Sr. Curriculum Manager

>> >> Cloudera, Inc

>> >> Cell: (323) 819 4075

>> >>

>> >



View raw message