Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of idryman@gmail.com designates
 209.85.220.47 as permitted sender)
From: Felix Chern <idryman@gmail.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_63F223D0-EEB0-41C0-B97F-A966456D768A"
Message-Id: <6484396B-C229-47B1-8105-73F69235C5E0@gmail.com>
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Subject: Re: All datanodes are bad IOException when trying to implement
 multithreading serialization
Date: Sun, 29 Sep 2013 18:58:23 -0700
References: 
 <CAGJXKHq9OhpOQy35gPGy7+WmkFWt7A3abOV8OgejGwRzy8uUsg@mail.gmail.com>
 <D606BB0C-05B4-4B75-AC8E-181971F77DBB@gmail.com>
 <CAGJXKHoYZW2_uGmdB_AsN6Z-+c9tuvM+upti-P_v1bNJOPLmJQ@mail.gmail.com>
To: user@hadoop.apache.org
In-Reply-To: 
 <CAGJXKHoYZW2_uGmdB_AsN6Z-+c9tuvM+upti-P_v1bNJOPLmJQ@mail.gmail.com>


--Apple-Mail=_63F223D0-EEB0-41C0-B97F-A966456D768A
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=iso-8859-1

The number of mappers usually is same as the number of the files you fed =
to it.
To reduce the number you can use CombineFileInputFormat.
I recently wrote an article about it. You can take a look if this fits =
your needs.

=
http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using=
-combinefileinputformat-1/

Felix

On Sep 29, 2013, at 6:45 PM, yunming zhang <zhangyunming1990@gmail.com> =
wrote:

> I am actually trying to reduce the number of mappers because my =
application takes up a lot of memory (in the order of 1-2 GB ram per =
mapper).  I want to be able to use a few mappers but still maintain good =
CPU utilization through multithreading within a single mapper. =
Multithreaded Mapper does't work because it duplicates in memory data =
structures.
>=20
> Thanks
>=20
> Yunming
>=20
>=20
> On Sun, Sep 29, 2013 at 6:59 PM, Sonal Goyal <sonalgoyal4@gmail.com> =
wrote:
> Wouldn't you rather just change your split size so that you can have =
more mappers work on your input? What else are you doing in the mappers?
> Sent from my iPad
>=20
> On Sep 30, 2013, at 2:22 AM, yunming zhang =
<zhangyunming1990@gmail.com> wrote:
>=20
>> Hi,=20
>>=20
>> I was playing with Hadoop code trying to have a single Mapper support =
reading a input split using multiple threads. I am getting All datanodes =
are bad IOException, and I am not sure what is the issue.=20
>>=20
>> The reason for this work is that I suspect my computation was slow =
because it takes too long to create the Text() objects from inputsplit =
using a single thread. I tried to modify the LineRecordReader (since I =
am mostly using TextInputFormat) to provide additional methods to =
retrieve lines from the input split  getCurrentKey2(), =
getCurrentValue2(), nextKeyValue2(). I created a second =
FSDataInputStream, and second LineReader object for getCurrentKey2(), =
getCurrentValue2() to read from. Essentially I am trying to open the =
input split twice with different start points (one in the very =
beginning, the other in the middle of the split) to read from input =
split in parallel using two threads. =20
>>=20
>> In the org.apache.hadoop.mapreduce.mapper.run() method, I modified it =
to read simultaneously using getCurrentKey() and getCurrentKey2() using =
Thread 1 and Thread 2 (both threads running at the same tim
>>       Thread 1:
>>        while(context.nextKeyValue()){
>>                   map(context.getCurrentKey(), =
context.getCurrentValue(), context);
>>         }
>>=20
>>       Thread 2:
>>         while(context.nextKeyValue2()){
>>                 map(context.getCurrentKey2(), =
context.getCurrentValue2(), context);
>>                 //System.out.println("two iter");
>>         }
>>=20
>> However, this causes me to see the All Datanodes are bad exception. I =
think I made sure that I closed the second file. I have attached a copy =
of my LineRecordReader file to show what I changed trying to enable two =
simultaneous read to the input split.=20
>>=20
>> I have modified other =
files(org.apache.hadoop.mapreduce.RecordReader.java, mapred.MapTask.java =
....)  just to enable Mapper.run to call =
LinRecordReader.getCurrentKey2() and other access methods for the second =
file.=20
>>=20
>>=20
>> I would really appreciate it if anyone could give me a bit advice or =
just point me to a direction as to where the problem might be,=20
>>=20
>> Thanks
>>=20
>> Yunming=20
>>=20
>> <LineRecordReader.java>
>=20


--Apple-Mail=_63F223D0-EEB0-41C0-B97F-A966456D768A
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=iso-8859-1

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Diso-8859-1"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">The =
number of mappers usually is same as the number of the files you fed to =
it.<div>To reduce the number you can use =
CombineFileInputFormat.</div><div>I recently wrote an article about it. =
You can take a look if this fits your needs.</div><div><br></div><div><a =
href=3D"http://www.idryman.org/blog/2013/09/22/process-small-files-on-hado=
op-using-combinefileinputformat-1/">http://www.idryman.org/blog/2013/09/22=
/process-small-files-on-hadoop-using-combinefileinputformat-1/</a></div><d=
iv><br></div><div>Felix</div><div><br><div><div>On Sep 29, 2013, at 6:45 =
PM, yunming zhang &lt;<a =
href=3D"mailto:zhangyunming1990@gmail.com">zhangyunming1990@gmail.com</a>&=
gt; wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite"><div dir=3D"ltr">I am actually trying to reduce the number =
of mappers because my application takes up a lot of memory (in the order =
of 1-2 GB ram per mapper). &nbsp;I want to be able to use a few mappers =
but still maintain good CPU utilization through multithreading within a =
single mapper. Multithreaded Mapper does't work because it duplicates in =
memory data structures.<div>

<br></div><div style=3D"">Thanks</div><div style=3D""><br></div><div =
style=3D"">Yunming</div></div><div class=3D"gmail_extra"><br><br><div =
class=3D"gmail_quote">On Sun, Sep 29, 2013 at 6:59 PM, Sonal Goyal <span =
dir=3D"ltr">&lt;<a href=3D"mailto:sonalgoyal4@gmail.com" =
target=3D"_blank">sonalgoyal4@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
dir=3D"auto"><div>Wouldn't you rather just change your split size so =
that you can have more mappers work on your input? What else are you =
doing in the mappers?<br>

Sent from my iPad</div><div class=3D"im"><br>On Sep 30, 2013, at 2:22 =
AM, yunming zhang &lt;<a href=3D"mailto:zhangyunming1990@gmail.com" =
target=3D"_blank">zhangyunming1990@gmail.com</a>&gt; =
wrote:<br><br></div><blockquote type=3D"cite">

<div><div dir=3D"ltr"><div class=3D"im">Hi,&nbsp;<div><br></div><div>I =
was playing with Hadoop code trying to have a single Mapper support =
reading a input split using multiple threads. I am getting All datanodes =
are bad IOException, and I am not sure what is the issue.&nbsp;</div>


<div><br></div><div>The reason for this work is that I suspect my =
computation was slow because it takes too long to create the Text() =
objects from inputsplit using a single thread.&nbsp;I tried to modify =
the LineRecordReader (since I am mostly using TextInputFormat) to =
provide additional methods to retrieve lines from the input split =
&nbsp;getCurrentKey2(), getCurrentValue2(), nextKeyValue2(). I created a =
second&nbsp;<span =
style=3D"font-family:Monaco;font-size:11px">FSDataInputStream, and =
second LineReader object </span><font face=3D"arial, helvetica, =
sans-serif">for getCurrentKey2(), getCurrentValue2() to read from. =
Essentially I am trying to open the input split twice with different =
start points (one in the very beginning, the other in the middle of the =
split) to read from input split in parallel using two threads. =
&nbsp;</font></div>


<div><span =
style=3D"font-family:Monaco;font-size:11px"><br></span></div><div><font =
face=3D"arial, helvetica, sans-serif">In the =
org.apache.hadoop.mapreduce.mapper.run() method, I modified it to read =
simultaneously using getCurrentKey() and getCurrentKey2() using Thread 1 =
and Thread 2 (both threads running at the same tim</font></div>

</div><div class=3D"im"><div>&nbsp; &nbsp; &nbsp; Thread =
1:</div><div>&nbsp; &nbsp; &nbsp; =
&nbsp;while(context.nextKeyValue()){</div>

<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
map(context.getCurrentKey(), context.getCurrentValue(), =
context);</div><div>&nbsp; &nbsp; &nbsp; =
&nbsp;&nbsp;}</div><div><br></div><div>&nbsp; &nbsp; &nbsp; Thread =
2:</div><div>&nbsp; &nbsp; &nbsp; &nbsp; =
while(context.nextKeyValue2()){</div><div>&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; map(context.getCurrentKey2(), =
context.getCurrentValue2(), context);</div>


<div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
//System.out.println("two iter");</div><div>&nbsp; &nbsp; &nbsp; &nbsp; =
}</div><div><br></div><div>However, this causes me to see the All =
Datanodes are bad exception. I think I made sure that I closed the =
second file. I have attached a copy of my LineRecordReader file to show =
what I changed trying to enable two simultaneous read to the input =
split.&nbsp;</div>


<div><br></div><div>I have modified other =
files(org.apache.hadoop.mapreduce.RecordReader.java, mapred.MapTask.java =
....) &nbsp;just to enable Mapper.run to call =
LinRecordReader.getCurrentKey2() and other access methods for the second =
file.&nbsp;</div>


<div><br></div><div><br></div><div>I would really appreciate it if =
anyone could give me a bit advice or just point me to a direction as to =
where the problem might be,&nbsp;</div><div><br></div><div>

=
Thanks</div><div><br></div><div>Yunming&nbsp;</div><div><br></div></div></=
div>
</div></blockquote><blockquote =
type=3D"cite">&lt;LineRecordReader.java&gt;</blockquote></div></blockquote=
></div><br></div>
</blockquote></div><br></div></body></html>=

--Apple-Mail=_63F223D0-EEB0-41C0-B97F-A966456D768A--