Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of shekhar2581@gmail.com
 designates 209.85.212.41 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAH1YrM6uYC=rykHAd2k3s3AJH+hDCLikGwWHMd8D0vGecaLaGQ@mail.gmail.com>
References: 
 <CAH1YrM43=6DKE+EERJbJ1wRKyUX9zXWJVKQ2sMd8zN0sTas7_A@mail.gmail.com>
	<6CC784D5-D3E0-45BC-916C-D9865AA4F27B@cloudera.com>
	<CAH1YrM5Z-OD-x8EwLzSg+6WUi6C7w8HtVsXzPK9vZKu4aVKT5A@mail.gmail.com>
	<CAJxyRCgeJN7fnxWEcfonYg_KO0i1F4yBbhwiNCFoov4poD0sOw@mail.gmail.com>
	<CAH1YrM6uYC=rykHAd2k3s3AJH+hDCLikGwWHMd8D0vGecaLaGQ@mail.gmail.com>
Date: Fri, 30 Aug 2013 12:39:46 +0530
Message-ID: 
 <CAJxyRCgAMfJsfu0YvcsSVVoa9k9VVL85D4tOQt5ZJZACht3GVg@mail.gmail.com>
Subject: Re: secondary sort - number of reducers
From: Shekhar Sharma <shekhar2581@gmail.com>
To: user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Is the hash code of that key  is negative.?
Do something like this

return groupKey.hashCode() & Integer.MAX_VALUE % numParts;

Regards,
Som Shekhar Sharma
+91-8197243810


On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <adeelmahmood@gmail.com> wrote:
> okay so when i specify the number of reducers e.g. in my example i m using 4
> (for a much smaller data set) it works if I use a single column in my
> composite key .. but if I add multiple columns in the composite key
> separated by a delimi .. it then throws the illegal partition error (keys
> before the pipe are group keys and after the pipe are the sort keys and my
> partioner only uses the group keys
>
> java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> (-1)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
>         at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
>         at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>
> public int getPartition(Text key, HCatRecord record, int numParts) {
> //extract the group key from composite key
> String groupKey = key.toString().split("\\|")[0];
> return groupKey.hashCode() % numParts;
> }
>
>
> On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <shekhar2581@gmail.com>
> wrote:
>>
>> No...partitionr decides which keys should go to which reducer...and
>> number of reducers you need to decide...No of reducers depends on
>> factors like number of key value pair, use case etc
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>>
>> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <adeelmahmood@gmail.com>
>> wrote:
>> > so it cant figure out an appropriate number of reducers as it does for
>> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
>> > reducer
>> > .. since im overriding the partitioner class shouldnt that decide how
>> > manyredeucers there should be based on how many different partition
>> > values
>> > being returned by the custom partiotioner
>> >
>> >
>> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ian@cloudera.com> wrote:
>> >>
>> >> If you don't specify the number of Reducers, Hadoop will use the
>> >> default
>> >> -- which, unless you've changed it, is 1.
>> >>
>> >> Regards
>> >>
>> >> Ian.
>> >>
>> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <adeelmahmood@gmail.com>
>> >> wrote:
>> >>
>> >> I have implemented secondary sort in my MR job and for some reason if i
>> >> dont specify the number of reducers it uses 1 which doesnt seems right
>> >> because im working with 800M+ records and one reducer slows things down
>> >> significantly. Is this some kind of limitation with the secondary sort
>> >> that
>> >> it has to use a single reducer .. that kind of would defeat the purpose
>> >> of
>> >> having a scalable solution such as secondary sort. I would appreciate
>> >> any
>> >> help.
>> >>
>> >> Thanks
>> >> Adeel
>> >>
>> >>
>> >>
>> >> ---
>> >> Ian Wrigley
>> >> Sr. Curriculum Manager
>> >> Cloudera, Inc
>> >> Cell: (323) 819 4075
>> >>
>> >
>
>