Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of adeelmahmood@gmail.com
 designates 74.125.82.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJxyRCgeJN7fnxWEcfonYg_KO0i1F4yBbhwiNCFoov4poD0sOw@mail.gmail.com>
References: 
 <CAH1YrM43=6DKE+EERJbJ1wRKyUX9zXWJVKQ2sMd8zN0sTas7_A@mail.gmail.com>
	<6CC784D5-D3E0-45BC-916C-D9865AA4F27B@cloudera.com>
	<CAH1YrM5Z-OD-x8EwLzSg+6WUi6C7w8HtVsXzPK9vZKu4aVKT5A@mail.gmail.com>
	<CAJxyRCgeJN7fnxWEcfonYg_KO0i1F4yBbhwiNCFoov4poD0sOw@mail.gmail.com>
Date: Thu, 29 Aug 2013 20:55:46 -0400
Message-ID: 
 <CAH1YrM6uYC=rykHAd2k3s3AJH+hDCLikGwWHMd8D0vGecaLaGQ@mail.gmail.com>
Subject: Re: secondary sort - number of reducers
From: Adeel Qureshi <adeelmahmood@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e012299c8cf8ee004e51fae32

--089e012299c8cf8ee004e51fae32
Content-Type: text/plain; charset=ISO-8859-1

okay so when i specify the number of reducers e.g. in my example i m using
4 (for a much smaller data set) it works if I use a single column in my
composite key .. but if I add multiple columns in the composite key
separated by a delimi .. it then throws the illegal partition error (keys
before the pipe are group keys and after the pipe are the sort keys and my
partioner only uses the group keys

java.io.IOException: Illegal partition for *Atlanta:GA|Atlanta:GA:1:Adeel*(-1)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
        at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
        at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)


public int getPartition(Text key, HCatRecord record, int numParts) {
//extract the group key from composite key
String groupKey = key.toString().split("\\|")[0];
return groupKey.hashCode() % numParts;
}


On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <shekhar2581@gmail.com>wrote:

> No...partitionr decides which keys should go to which reducer...and
> number of reducers you need to decide...No of reducers depends on
> factors like number of key value pair, use case etc
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <adeelmahmood@gmail.com>
> wrote:
> > so it cant figure out an appropriate number of reducers as it does for
> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
> reducer
> > .. since im overriding the partitioner class shouldnt that decide how
> > manyredeucers there should be based on how many different partition
> values
> > being returned by the custom partiotioner
> >
> >
> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ian@cloudera.com> wrote:
> >>
> >> If you don't specify the number of Reducers, Hadoop will use the default
> >> -- which, unless you've changed it, is 1.
> >>
> >> Regards
> >>
> >> Ian.
> >>
> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <adeelmahmood@gmail.com>
> wrote:
> >>
> >> I have implemented secondary sort in my MR job and for some reason if i
> >> dont specify the number of reducers it uses 1 which doesnt seems right
> >> because im working with 800M+ records and one reducer slows things down
> >> significantly. Is this some kind of limitation with the secondary sort
> that
> >> it has to use a single reducer .. that kind of would defeat the purpose
> of
> >> having a scalable solution such as secondary sort. I would appreciate
> any
> >> help.
> >>
> >> Thanks
> >> Adeel
> >>
> >>
> >>
> >> ---
> >> Ian Wrigley
> >> Sr. Curriculum Manager
> >> Cloudera, Inc
> >> Cell: (323) 819 4075
> >>
> >
>

--089e012299c8cf8ee004e51fae32
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">okay so when i specify the number of reducers e.g. in my e=
xample i m using 4 (for a much smaller data set) it works if I use a single=
 column in my composite key .. but if I add multiple columns in the composi=
te key separated by a delimi .. it then throws the illegal partition error =
(keys before the pipe are group keys and after the pipe are the sort keys a=
nd my partioner only uses the group keys<div>
<br></div><div><div>java.io.IOException: Illegal partition for <b>Atlanta:G=
A|Atlanta:GA:1:Adeel</b> (-1)</div><div>=A0 =A0 =A0 =A0 at org.apache.hadoo=
p.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)</div><div>=A0 =
=A0 =A0 =A0 at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(Ma=
pTask.java:691)</div>
<div>=A0 =A0 =A0 =A0 at org.apache.hadoop.mapreduce.TaskInputOutputContext.=
write(TaskInputOutputContext.java:80)</div><div>=A0 =A0 =A0 =A0 at com.att.=
hadoop.hivesort.HSMapper.map(HSMapper.java:39)</div><div>=A0 =A0 =A0 =A0 at=
 com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)</div>
<div>=A0 =A0 =A0 =A0 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:=
144)</div><div>=A0 =A0 =A0 =A0 at org.apache.hadoop.mapred.MapTask.runNewMa=
pper(MapTask.java:764)</div><div>=A0 =A0 =A0 =A0 at org.apache.hadoop.mapre=
d.MapTask.run(MapTask.java:370)</div>
<div>=A0 =A0 =A0 =A0 at org.apache.hadoop.mapred.Child$4.run(Child.java:255=
)</div><div>=A0 =A0 =A0 =A0 at java.security.AccessController.doPrivileged(=
Native Method)</div><div>=A0 =A0 =A0 =A0 at javax.security.auth.Subject.doA=
s(Subject.java:396)</div>
<div>=A0 =A0 =A0 =A0 at org.apache.hadoop.security.UserGroupInformation.doA=
s(UserGroupInformation.java:1136)</div><div>=A0 =A0 =A0 =A0 at org.apache.h=
adoop.mapred.Child.main(Child.java:249)</div><div><br></div><div><br><div c=
lass=3D"gmail_extra">
<div class=3D"gmail_extra"><span class=3D"" style=3D"white-space:pre">	</sp=
an>public int getPartition(Text key, HCatRecord record, int numParts) {</di=
v><div class=3D"gmail_extra"><span class=3D"" style=3D"white-space:pre">		<=
/span>//extract the group key from composite key</div>
<div class=3D"gmail_extra"><span class=3D"" style=3D"white-space:pre">		</s=
pan>String groupKey =3D key.toString().split(&quot;\\|&quot;)[0];<span clas=
s=3D"" style=3D"white-space:pre">		</span></div><div class=3D"gmail_extra">=
<span class=3D"" style=3D"white-space:pre">		</span>return groupKey.hashCod=
e() % numParts;<br>
</div><div class=3D"gmail_extra"><span class=3D"" style=3D"white-space:pre"=
>	</span>}</div><div><br></div><br><div class=3D"gmail_quote">On Thu, Aug 2=
9, 2013 at 8:31 PM, Shekhar Sharma <span dir=3D"ltr">&lt;<a href=3D"mailto:=
shekhar2581@gmail.com" target=3D"_blank">shekhar2581@gmail.com</a>&gt;</spa=
n> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">No...partitionr decides which keys should go to which redu=
cer...and<br>

number of reducers you need to decide...No of reducers depends on<br>
factors like number of key value pair, use case etc<br>
Regards,<br>
Som Shekhar Sharma<br>
<a href=3D"tel:%2B91-8197243810" value=3D"+918197243810">+91-8197243810</a>=
<br>
<div class=3D""><div class=3D"h5"><br>
<br>
On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi &lt;<a href=3D"mailto:adeelm=
ahmood@gmail.com">adeelmahmood@gmail.com</a>&gt; wrote:<br>
&gt; so it cant figure out an appropriate number of reducers as it does for=
<br>
&gt; mappers .. in my case hadoop is using 2100+ mappers and then only 1 re=
ducer<br>
&gt; .. since im overriding the partitioner class shouldnt that decide how<=
br>
&gt; manyredeucers there should be based on how many different partition va=
lues<br>
&gt; being returned by the custom partiotioner<br>
&gt;<br>
&gt;<br>
&gt; On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley &lt;<a href=3D"mailto:ian=
@cloudera.com">ian@cloudera.com</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; If you don&#39;t specify the number of Reducers, Hadoop will use t=
he default<br>
&gt;&gt; -- which, unless you&#39;ve changed it, is 1.<br>
&gt;&gt;<br>
&gt;&gt; Regards<br>
&gt;&gt;<br>
&gt;&gt; Ian.<br>
&gt;&gt;<br>
&gt;&gt; On Aug 29, 2013, at 4:23 PM, Adeel Qureshi &lt;<a href=3D"mailto:a=
deelmahmood@gmail.com">adeelmahmood@gmail.com</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; I have implemented secondary sort in my MR job and for some reason=
 if i<br>
&gt;&gt; dont specify the number of reducers it uses 1 which doesnt seems r=
ight<br>
&gt;&gt; because im working with 800M+ records and one reducer slows things=
 down<br>
&gt;&gt; significantly. Is this some kind of limitation with the secondary =
sort that<br>
&gt;&gt; it has to use a single reducer .. that kind of would defeat the pu=
rpose of<br>
&gt;&gt; having a scalable solution such as secondary sort. I would appreci=
ate any<br>
&gt;&gt; help.<br>
&gt;&gt;<br>
&gt;&gt; Thanks<br>
&gt;&gt; Adeel<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; ---<br>
&gt;&gt; Ian Wrigley<br>
&gt;&gt; Sr. Curriculum Manager<br>
&gt;&gt; Cloudera, Inc<br>
&gt;&gt; Cell: <a href=3D"tel:%28323%29%20819%204075" value=3D"+13238194075=
">(323) 819 4075</a><br>
&gt;&gt;<br>
&gt;<br>
</div></div></blockquote></div><br></div></div></div></div>

--089e012299c8cf8ee004e51fae32--