Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of rabmdu@gmail.com designates
 209.85.220.173 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <83870647-9F12-47AB-9790-7FD3B1806EDF@gmail.com>
References: 
 <CANXCz3QLdeUdC0BhkTDw2m1VBT0wf=FQhPkfRp8mbbiksZy5zg@mail.gmail.com>
	<CAEo-6+Saj_9HS6afCxC0n=UWBhroCy4E07JvDo=4ipLyWv-myQ@mail.gmail.com>
	<CANXCz3T3B878R2YaSuonML=d17seasC45Tv9ss=kwQtc3Hx4=w@mail.gmail.com>
	<83870647-9F12-47AB-9790-7FD3B1806EDF@gmail.com>
Date: Tue, 2 Sep 2014 10:48:41 +0530
Message-ID: 
 <CANXCz3QpN-VyC_J-pm8jXp-xafg2j_qP8YGH0FPWAmc3da3c-w@mail.gmail.com>
Subject: Re: Hadoop InputFormat - Processing large number of small files
From: rab ra <rabmdu@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=e0cb4e887c09b0b15e05020e40fd

--e0cb4e887c09b0b15e05020e40fd
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi
>
>
>
> I tried to use your CombileFileInputFormat implementation. However, I get
the following exception
>
>
>
> =E2=80=98not org.apache.hadoop.mapred.InputFormat=E2=80=99
>
>
>
> I am using hadoop 2.4.1 and it looks like it expect older interface as it
does not accept
=E2=80=98org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat=E2=80=
=99.  May I know
what version of Hadoop you used?
>
>
>
>
>
> Looks like I need to use older one
=E2=80=98org.apache.hadoop.mapred.lib.CombineFileInputFormat=E2=80=99 ?
>
>
>
> Thanks and Regards
>
> rab
On 20 Aug 2014 22:59, "Felix Chern" <idryman@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-usin=
g-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file yo=
u
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt =3D new Text(); private IntWritable
> count =3D new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st =3D new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <rabmdu@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filena=
me
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map proces=
s
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that i=
t
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <shahab.yunus@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-ha=
doop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/W=
holeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <rabmdu@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that conta=
ins
>>> list of filenames. Each map get one line of it via NLineInputFormat. Th=
e
>>> map process then accesses the file via FSDataInputStream and work with =
it.
>>> Is there a way to ensure this map process is running on the node where =
the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here i=
s a
>>> format that can process more than one files in a single map but does no=
t
>>> have to read the files, and either in key or value, it has the filename=
s.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

--e0cb4e887c09b0b15e05020e40fd
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p></p>
<p>Hi<br>
&gt;<br>
&gt; =C2=A0<br>
&gt;<br>
&gt; I tried to use your CombileFileInputFormat implementation. However, I =
get the following exception<br>
&gt;<br>
&gt; =C2=A0<br>
&gt;<br>
&gt; =E2=80=98not org.apache.hadoop.mapred.InputFormat=E2=80=99<br>
&gt;<br>
&gt; =C2=A0<br>
&gt;<br>
&gt; I am using hadoop 2.4.1 and it looks like it expect older interface as=
 it does not accept =E2=80=98org.apache.hadoop.mapreduce.lib.input.CombineF=
ileInputFormat=E2=80=99. =C2=A0May I know what version of Hadoop you used?<=
br>
&gt;<br>
&gt; =C2=A0<br>
&gt;<br>
&gt; =C2=A0<br>
&gt;<br>
&gt; Looks like I need to use older one =E2=80=98org.apache.hadoop.mapred.l=
ib.CombineFileInputFormat=E2=80=99 ?<br>
&gt;<br>
&gt; =C2=A0<br>
&gt;<br>
&gt; Thanks and Regards<br>
&gt;<br>
&gt; rab</p>
<div class=3D"gmail_quote">On 20 Aug 2014 22:59, &quot;Felix Chern&quot; &l=
t;<a href=3D"mailto:idryman@gmail.com">idryman@gmail.com</a>&gt; wrote:<br =
type=3D"attribution"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 =
0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style=3D"word-wrap:break-word">I wrote a post on how to use CombineInp=
utFormat:<div><a href=3D"http://www.idryman.org/blog/2013/09/22/process-sma=
ll-files-on-hadoop-using-combinefileinputformat-1/" target=3D"_blank">http:=
//www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combi=
nefileinputformat-1/</a></div>
<div>In the RecordReader constructor, you can get the context of which file=
 you are reading in.</div><div>In my example, I created FileLineWritable to=
 include the filename in the mapper input key.</div><div>Then you can use t=
he input key as:</div>
<div><br></div><div><table style=3D"border-collapse:collapse;border-spacing=
:0px;background-color:rgb(255,255,255);overflow-x:auto;overflow-y:hidden;bo=
rder-bottom-left-radius:3px;border-bottom-right-radius:3px;color:rgb(51,51,=
51);font-family:Helvetica,arial,freesans,clean,sans-serif,&#39;Segoe UI Emo=
ji&#39;,&#39;Segoe UI Symbol&#39;;font-size:13px;line-height:18px">
<tbody><tr><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberati=
on Mono&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;ov=
erflow:visible">  </td></tr><tr><td style=3D"padding:0px 8px;width:32px;fon=
t-family:Consolas,&#39;Liberation Mono&#39;,Menlo,Courier,monospace;color:r=
gba(0,0,0,0.298039);vertical-align:top;text-align:right;border-right-width:=
1px;border-right-style:solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">  <span style=3D"font-weight:bold">public</span> <span style=3D"f=
ont-weight:bold">static</span> <span style=3D"font-weight:bold">class</span=
> <span style=3D"color:rgb(68,85,136);font-weight:bold">TestMapper</span> <=
span style=3D"font-weight:bold">extends</span> <span>Mapper</span><span sty=
le=3D"font-weight:bold">&lt;</span><span>FileLineWritable</span><span style=
=3D"font-weight:bold">,</span> <span>Text</span><span style=3D"font-weight:=
bold">,</span> <span>Text</span><span style=3D"font-weight:bold">,</span> <=
span>IntWritable</span><span style=3D"font-weight:bold">&gt;{</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">    <span style=3D"font-weight:bold">private</span> <span>Text</s=
pan> <span>txt</span> <span style=3D"font-weight:bold">=3D</span> <span sty=
le=3D"font-weight:bold">new</span> <span style=3D"color:rgb(153,0,0);font-w=
eight:bold">Text</span><span style=3D"font-weight:bold">();</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">    <span style=3D"font-weight:bold">private</span> <span>IntWrit=
able</span> <span>count</span> <span style=3D"font-weight:bold">=3D</span> =
<span style=3D"font-weight:bold">new</span> <span style=3D"color:rgb(153,0,=
0);font-weight:bold">IntWritable</span><span style=3D"font-weight:bold">(</=
span><span style=3D"color:rgb(0,153,153)">1</span><span style=3D"font-weigh=
t:bold">);</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">    <span style=3D"font-weight:bold">public</span> <span style=3D=
"color:rgb(68,85,136);font-weight:bold">void</span> <span style=3D"color:rg=
b(153,0,0);font-weight:bold">map</span> <span style=3D"font-weight:bold">(<=
/span><span>FileLineWritable</span> <span>key</span><span style=3D"font-wei=
ght:bold">,</span> <span>Text</span> <span>val</span><span style=3D"font-we=
ight:bold">,</span> <span>Context</span> <span>context</span><span style=3D=
"font-weight:bold">)</span> <span style=3D"font-weight:bold">throws</span> =
<span>IOException</span><span style=3D"font-weight:bold">,</span> <span>Int=
erruptedException</span><span style=3D"font-weight:bold">{</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">      <span>StringTokenizer</span> <span>st</span> <span style=3D=
"font-weight:bold">=3D</span> <span style=3D"font-weight:bold">new</span> <=
span style=3D"color:rgb(153,0,0);font-weight:bold">StringTokenizer</span><s=
pan style=3D"font-weight:bold">(</span><span>val</span><span style=3D"font-=
weight:bold">.</span><span style=3D"color:teal">toString</span><span style=
=3D"font-weight:bold">());</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">        <span style=3D"font-weight:bold">while</span> <span style=
=3D"font-weight:bold">(</span><span>st</span><span style=3D"font-weight:bol=
d">.</span><span style=3D"color:teal">hasMoreTokens</span><span style=3D"fo=
nt-weight:bold">()){</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">          <span>txt</span><span style=3D"font-weight:bold">.</spa=
n><span style=3D"color:teal">set</span><span style=3D"font-weight:bold">(ke=
y.fileName + </span><span>st</span><span style=3D"font-weight:bold">.</span=
><span style=3D"color:teal">nextToken</span><span style=3D"font-weight:bold=
">());</span>          </td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">          <span>context</span><span style=3D"font-weight:bold">.<=
/span><span style=3D"color:teal">write</span><span style=3D"font-weight:bol=
d">(</span><span>txt</span><span style=3D"font-weight:bold">,</span> <span>=
count</span><span style=3D"font-weight:bold">);</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">        <span style=3D"font-weight:bold">}</span></td></tr><tr><t=
d style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;Liberation =
Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);vertical-align=
:top;text-align:right;border-right-width:1px;border-right-style:solid;borde=
r-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">    <span style=3D"font-weight:bold">}</span></td></tr><tr><td st=
yle=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;Liberation Mono=
&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);vertical-align:top=
;text-align:right;border-right-width:1px;border-right-style:solid;border-ri=
ght-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">  <span style=3D"font-weight:bold">}</span></td></tr></tbody></ta=
ble>
<div><br></div></div><div><br></div><div>Cheers,</div><div>Felix</div><div>=
<br></div><div><br><div><div>On Aug 20, 2014, at 8:19 AM, rab ra &lt;<a hre=
f=3D"mailto:rabmdu@gmail.com" target=3D"_blank">rabmdu@gmail.com</a>&gt; wr=
ote:</div>
<br><blockquote type=3D"cite"><div dir=3D"ltr">Thanks for the response.<div=
><br></div><div>Yes, I know wholeFileInputFormat. But i am not sure filenam=
e comes to map process either as key or value. But, I think this file forma=
t reads the contents of the file. I wish to have a inputformat that just gi=
ves filename or list of filenames.</div>

<div><br></div><div>Also, files are very small. The wholeFileInputFormat sp=
ans one map process per file and thus results huge number of map processes.=
 I wish to span a single map process per group of files.=C2=A0</div><div><b=
r>

</div><div>I think I need to tweak CombineFileInputFormat&#39;s recordreade=
r() so that it does not read the entire file but just filename.</div><div><=
br></div><div><br></div><div>regards</div><div>rab</div><div><br></div>

<div>regards</div><div>Bala</div></div><div class=3D"gmail_extra"><br><br><=
div class=3D"gmail_quote">On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:shahab.yunus@gmail.com" target=3D"_bla=
nk">shahab.yunus@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Have you looked at the Whol=
eFileInputFormat implementations? There are quite a few if search for them.=
..<div>

<br></div><div><a href=3D"http://hadoop-sandy.blogspot.com/2013/02/wholefil=
einputformat-in-java-hadoop.html" target=3D"_blank">http://hadoop-sandy.blo=
gspot.com/2013/02/wholefileinputformat-in-java-hadoop.html</a><br>
</div><div><a href=3D"https://github.com/tomwhite/hadoop-book/blob/master/c=
h07/src/main/java/WholeFileInputFormat.java" target=3D"_blank">https://gith=
ub.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFo=
rmat.java</a><br>


</div><div><br></div><div>Regards,</div><div>Shahab</div></div><div class=
=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Wed, Aug 20, 2014 at=
 1:46 AM, rab ra <span dir=3D"ltr">&lt;<a href=3D"mailto:rabmdu@gmail.com" =
target=3D"_blank">rabmdu@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><span style=3D"font-fa=
mily:Arial,sans-serif;font-size:13.333333969116211px">Hello,</span></div><s=
pan style=3D"font-family:Arial,sans-serif;font-size:13.333333969116211px"><=
div>


<span style=3D"font-family:Arial,sans-serif;font-size:13.333333969116211px"=
><br>
</span></div>I have a use case wherein i need to process huge set of files =
stored in HDFS. Those files are non-splittable and they need to be processe=
d as a whole. Here, I have the following question for which I need answers =
to proceed further in this.</span><div>


<span style=3D"font-family:Arial,sans-serif;font-size:13.333333969116211px"=
><br></span></div><div><span style=3D"font-family:Arial,sans-serif;font-siz=
e:13.333333969116211px">1. =C2=A0I wish to schedule the map process in task=
 tracker where data is already available. How can I do it? Currently, I hav=
e a file that contains list of filenames. Each map get one line of it via N=
LineInputFormat. The map process then accesses the file via FSDataInputStre=
am and work with it. Is there a way to ensure this map process is running o=
n the node where the file is available?.=C2=A0</span><br>


</div><div><span style=3D"font-family:Arial,sans-serif;font-size:13.3333339=
69116211px"><br></span></div><div><span style=3D"font-family:Arial,sans-ser=
if;font-size:13.333333969116211px">2. =C2=A0Since the files are not large a=
nd it can be called as &#39;small&#39; files by hadoop standard. Now, I cam=
e across CombineFileInputFormat that can process more than one file in a si=
ngle map process. =C2=A0What I need here is a format that can process more =
than one files in a single map but does not have to read the files, and eit=
her in key or value, it has the filenames. In map process then, I can run a=
 loop to process these files. Any help?</span></div>


<div><span style=3D"font-family:Arial,sans-serif;font-size:13.3333339691162=
11px"><br></span></div><div><span style=3D"font-family:Arial,sans-serif;fon=
t-size:13.333333969116211px">3. Any othe alternatives?</span></div><div><sp=
an style=3D"font-family:Arial,sans-serif;font-size:13.333333969116211px"><b=
r>


</span></div><div><span style=3D"font-family:Arial,sans-serif;font-size:13.=
333333969116211px"><br></span></div><div><span style=3D"font-family:Arial,s=
ans-serif;font-size:13.333333969116211px"><br></span></div><div><span style=
=3D"font-family:Arial,sans-serif;font-size:13.333333969116211px">regards</s=
pan></div>


<span><font color=3D"#888888">
<div><font face=3D"Arial, sans-serif">rab</font></div><div><br></div></font=
></span></div>
</blockquote></div><br></div>
</blockquote></div><br></div>
</blockquote></div><br></div></div></blockquote></div>

--e0cb4e887c09b0b15e05020e40fd--