Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of rabmdu@gmail.com designates
 209.85.220.179 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <83870647-9F12-47AB-9790-7FD3B1806EDF@gmail.com>
References: 
 <CANXCz3QLdeUdC0BhkTDw2m1VBT0wf=FQhPkfRp8mbbiksZy5zg@mail.gmail.com>
	<CAEo-6+Saj_9HS6afCxC0n=UWBhroCy4E07JvDo=4ipLyWv-myQ@mail.gmail.com>
	<CANXCz3T3B878R2YaSuonML=d17seasC45Tv9ss=kwQtc3Hx4=w@mail.gmail.com>
	<83870647-9F12-47AB-9790-7FD3B1806EDF@gmail.com>
Date: Thu, 21 Aug 2014 14:08:32 +0530
Message-ID: 
 <CANXCz3RG-q+Dv6hMHDnH9sAeddMu3MTeV71DLe-AH5trL0drZg@mail.gmail.com>
Subject: Re: Hadoop InputFormat - Processing large number of small files
From: rab ra <rabmdu@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7b33dd1e52a98805011fa5e0

--047d7b33dd1e52a98805011fa5e0
Content-Type: text/plain; charset=UTF-8

Thanks for the link. If it is not required for CFinputformat to have
contents of the files in the map process but only the filename, what
changes need to be done in the code?

rab.
On 20 Aug 2014 22:59, "Felix Chern" <idryman@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <rabmdu@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <shahab.yunus@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <rabmdu@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

--047d7b33dd1e52a98805011fa5e0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p>Thanks for the link. If it is not required for CFinputformat to have con=
tents of the files in the map process but only the filename, what changes n=
eed to be done in the code?</p>
<p>rab.</p>
<div class=3D"gmail_quote">On 20 Aug 2014 22:59, &quot;Felix Chern&quot; &l=
t;<a href=3D"mailto:idryman@gmail.com">idryman@gmail.com</a>&gt; wrote:<br =
type=3D"attribution"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 =
0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style=3D"word-wrap:break-word">I wrote a post on how to use CombineInp=
utFormat:<div><a href=3D"http://www.idryman.org/blog/2013/09/22/process-sma=
ll-files-on-hadoop-using-combinefileinputformat-1/" target=3D"_blank">http:=
//www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combi=
nefileinputformat-1/</a></div>
<div>In the RecordReader constructor, you can get the context of which file=
 you are reading in.</div><div>In my example, I created FileLineWritable to=
 include the filename in the mapper input key.</div><div>Then you can use t=
he input key as:</div>
<div><br></div><div><table style=3D"border-collapse:collapse;border-spacing=
:0px;background-color:rgb(255,255,255);overflow-x:auto;overflow-y:hidden;bo=
rder-bottom-left-radius:3px;border-bottom-right-radius:3px;color:rgb(51,51,=
51);font-family:Helvetica,arial,freesans,clean,sans-serif,&#39;Segoe UI Emo=
ji&#39;,&#39;Segoe UI Symbol&#39;;font-size:13px;line-height:18px">
<tbody><tr><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberati=
on Mono&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;ov=
erflow:visible">  </td></tr><tr><td style=3D"padding:0px 8px;width:32px;fon=
t-family:Consolas,&#39;Liberation Mono&#39;,Menlo,Courier,monospace;color:r=
gba(0,0,0,0.298039);vertical-align:top;text-align:right;border-right-width:=
1px;border-right-style:solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">  <span style=3D"font-weight:bold">public</span> <span style=3D"f=
ont-weight:bold">static</span> <span style=3D"font-weight:bold">class</span=
> <span style=3D"color:rgb(68,85,136);font-weight:bold">TestMapper</span> <=
span style=3D"font-weight:bold">extends</span> <span>Mapper</span><span sty=
le=3D"font-weight:bold">&lt;</span><span>FileLineWritable</span><span style=
=3D"font-weight:bold">,</span> <span>Text</span><span style=3D"font-weight:=
bold">,</span> <span>Text</span><span style=3D"font-weight:bold">,</span> <=
span>IntWritable</span><span style=3D"font-weight:bold">&gt;{</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">    <span style=3D"font-weight:bold">private</span> <span>Text</s=
pan> <span>txt</span> <span style=3D"font-weight:bold">=3D</span> <span sty=
le=3D"font-weight:bold">new</span> <span style=3D"color:rgb(153,0,0);font-w=
eight:bold">Text</span><span style=3D"font-weight:bold">();</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">    <span style=3D"font-weight:bold">private</span> <span>IntWrit=
able</span> <span>count</span> <span style=3D"font-weight:bold">=3D</span> =
<span style=3D"font-weight:bold">new</span> <span style=3D"color:rgb(153,0,=
0);font-weight:bold">IntWritable</span><span style=3D"font-weight:bold">(</=
span><span style=3D"color:rgb(0,153,153)">1</span><span style=3D"font-weigh=
t:bold">);</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">    <span style=3D"font-weight:bold">public</span> <span style=3D=
"color:rgb(68,85,136);font-weight:bold">void</span> <span style=3D"color:rg=
b(153,0,0);font-weight:bold">map</span> <span style=3D"font-weight:bold">(<=
/span><span>FileLineWritable</span> <span>key</span><span style=3D"font-wei=
ght:bold">,</span> <span>Text</span> <span>val</span><span style=3D"font-we=
ight:bold">,</span> <span>Context</span> <span>context</span><span style=3D=
"font-weight:bold">)</span> <span style=3D"font-weight:bold">throws</span> =
<span>IOException</span><span style=3D"font-weight:bold">,</span> <span>Int=
erruptedException</span><span style=3D"font-weight:bold">{</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">      <span>StringTokenizer</span> <span>st</span> <span style=3D=
"font-weight:bold">=3D</span> <span style=3D"font-weight:bold">new</span> <=
span style=3D"color:rgb(153,0,0);font-weight:bold">StringTokenizer</span><s=
pan style=3D"font-weight:bold">(</span><span>val</span><span style=3D"font-=
weight:bold">.</span><span style=3D"color:teal">toString</span><span style=
=3D"font-weight:bold">());</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">        <span style=3D"font-weight:bold">while</span> <span style=
=3D"font-weight:bold">(</span><span>st</span><span style=3D"font-weight:bol=
d">.</span><span style=3D"color:teal">hasMoreTokens</span><span style=3D"fo=
nt-weight:bold">()){</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">          <span>txt</span><span style=3D"font-weight:bold">.</spa=
n><span style=3D"color:teal">set</span><span style=3D"font-weight:bold">(ke=
y.fileName + </span><span>st</span><span style=3D"font-weight:bold">.</span=
><span style=3D"color:teal">nextToken</span><span style=3D"font-weight:bold=
">());</span>          </td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">          <span>context</span><span style=3D"font-weight:bold">.<=
/span><span style=3D"color:teal">write</span><span style=3D"font-weight:bol=
d">(</span><span>txt</span><span style=3D"font-weight:bold">,</span> <span>=
count</span><span style=3D"font-weight:bold">);</span></td>
</tr><tr><td style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;=
Liberation Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);ver=
tical-align:top;text-align:right;border-right-width:1px;border-right-style:=
solid;border-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">        <span style=3D"font-weight:bold">}</span></td></tr><tr><t=
d style=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;Liberation =
Mono&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);vertical-align=
:top;text-align:right;border-right-width:1px;border-right-style:solid;borde=
r-right-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">    <span style=3D"font-weight:bold">}</span></td></tr><tr><td st=
yle=3D"padding:0px 8px;width:32px;font-family:Consolas,&#39;Liberation Mono=
&#39;,Menlo,Courier,monospace;color:rgba(0,0,0,0.298039);vertical-align:top=
;text-align:right;border-right-width:1px;border-right-style:solid;border-ri=
ght-color:rgb(229,229,229)">
</td><td style=3D"padding:0px 10px;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Menlo,Courier,monospace;font-size:12px;white-space:pre-wrap;overflow=
:visible">  <span style=3D"font-weight:bold">}</span></td></tr></tbody></ta=
ble>
<div><br></div></div><div><br></div><div>Cheers,</div><div>Felix</div><div>=
<br></div><div><br><div><div>On Aug 20, 2014, at 8:19 AM, rab ra &lt;<a hre=
f=3D"mailto:rabmdu@gmail.com" target=3D"_blank">rabmdu@gmail.com</a>&gt; wr=
ote:</div>
<br><blockquote type=3D"cite"><div dir=3D"ltr">Thanks for the response.<div=
><br></div><div>Yes, I know wholeFileInputFormat. But i am not sure filenam=
e comes to map process either as key or value. But, I think this file forma=
t reads the contents of the file. I wish to have a inputformat that just gi=
ves filename or list of filenames.</div>

<div><br></div><div>Also, files are very small. The wholeFileInputFormat sp=
ans one map process per file and thus results huge number of map processes.=
 I wish to span a single map process per group of files.=C2=A0</div><div><b=
r>

</div><div>I think I need to tweak CombineFileInputFormat&#39;s recordreade=
r() so that it does not read the entire file but just filename.</div><div><=
br></div><div><br></div><div>regards</div><div>rab</div><div><br></div>

<div>regards</div><div>Bala</div></div><div class=3D"gmail_extra"><br><br><=
div class=3D"gmail_quote">On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:shahab.yunus@gmail.com" target=3D"_bla=
nk">shahab.yunus@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Have you looked at the Whol=
eFileInputFormat implementations? There are quite a few if search for them.=
..<div>

<br></div><div><a href=3D"http://hadoop-sandy.blogspot.com/2013/02/wholefil=
einputformat-in-java-hadoop.html" target=3D"_blank">http://hadoop-sandy.blo=
gspot.com/2013/02/wholefileinputformat-in-java-hadoop.html</a><br>
</div><div><a href=3D"https://github.com/tomwhite/hadoop-book/blob/master/c=
h07/src/main/java/WholeFileInputFormat.java" target=3D"_blank">https://gith=
ub.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFo=
rmat.java</a><br>


</div><div><br></div><div>Regards,</div><div>Shahab</div></div><div class=
=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Wed, Aug 20, 2014 at=
 1:46 AM, rab ra <span dir=3D"ltr">&lt;<a href=3D"mailto:rabmdu@gmail.com" =
target=3D"_blank">rabmdu@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><span style=3D"font-fa=
mily:Arial,sans-serif;font-size:13.333333969116211px">Hello,</span></div><s=
pan style=3D"font-family:Arial,sans-serif;font-size:13.333333969116211px"><=
div>


<span style=3D"font-family:Arial,sans-serif;font-size:13.333333969116211px"=
><br>
</span></div>I have a use case wherein i need to process huge set of files =
stored in HDFS. Those files are non-splittable and they need to be processe=
d as a whole. Here, I have the following question for which I need answers =
to proceed further in this.</span><div>


<span style=3D"font-family:Arial,sans-serif;font-size:13.333333969116211px"=
><br></span></div><div><span style=3D"font-family:Arial,sans-serif;font-siz=
e:13.333333969116211px">1. =C2=A0I wish to schedule the map process in task=
 tracker where data is already available. How can I do it? Currently, I hav=
e a file that contains list of filenames. Each map get one line of it via N=
LineInputFormat. The map process then accesses the file via FSDataInputStre=
am and work with it. Is there a way to ensure this map process is running o=
n the node where the file is available?.=C2=A0</span><br>


</div><div><span style=3D"font-family:Arial,sans-serif;font-size:13.3333339=
69116211px"><br></span></div><div><span style=3D"font-family:Arial,sans-ser=
if;font-size:13.333333969116211px">2. =C2=A0Since the files are not large a=
nd it can be called as &#39;small&#39; files by hadoop standard. Now, I cam=
e across CombineFileInputFormat that can process more than one file in a si=
ngle map process. =C2=A0What I need here is a format that can process more =
than one files in a single map but does not have to read the files, and eit=
her in key or value, it has the filenames. In map process then, I can run a=
 loop to process these files. Any help?</span></div>


<div><span style=3D"font-family:Arial,sans-serif;font-size:13.3333339691162=
11px"><br></span></div><div><span style=3D"font-family:Arial,sans-serif;fon=
t-size:13.333333969116211px">3. Any othe alternatives?</span></div><div><sp=
an style=3D"font-family:Arial,sans-serif;font-size:13.333333969116211px"><b=
r>


</span></div><div><span style=3D"font-family:Arial,sans-serif;font-size:13.=
333333969116211px"><br></span></div><div><span style=3D"font-family:Arial,s=
ans-serif;font-size:13.333333969116211px"><br></span></div><div><span style=
=3D"font-family:Arial,sans-serif;font-size:13.333333969116211px">regards</s=
pan></div>


<span><font color=3D"#888888">
<div><font face=3D"Arial, sans-serif">rab</font></div><div><br></div></font=
></span></div>
</blockquote></div><br></div>
</blockquote></div><br></div>
</blockquote></div><br></div></div></blockquote></div>

--047d7b33dd1e52a98805011fa5e0--