Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 39DB611228 for ; Tue, 2 Sep 2014 05:19:16 +0000 (UTC) Received: (qmail 72195 invoked by uid 500); 2 Sep 2014 05:19:07 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 72085 invoked by uid 500); 2 Sep 2014 05:19:06 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 72075 invoked by uid 99); 2 Sep 2014 05:19:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Sep 2014 05:19:06 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rabmdu@gmail.com designates 209.85.220.173 as permitted sender) Received: from [209.85.220.173] (HELO mail-vc0-f173.google.com) (209.85.220.173) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Sep 2014 05:19:02 +0000 Received: by mail-vc0-f173.google.com with SMTP id im17so6435595vcb.32 for ; Mon, 01 Sep 2014 22:18:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=LdK1vWbwVhGWD/dKKH9S3lQQBHoTxH++/ej1IHmO0vw=; b=mLtpNN417oRE+NFVJTZsS4UjdGxZURoQTIFm2FoGhCSXGeSZsU/V8XLDtCemPDt5X2 SqC0CvfbBYVJiR25T+s/bJL9a80rpK20Gf/wH13u+ARlXhLHQrBOtfJlfqTdI24ULIJL hS2B7MvKmHnimrBYjmliTlOcYH4rRcS3qC9+KoGO5/mXYmczSsLSbZcgl8MB8ThojvAJ pxVy6LhADkJ/ffaHU+rorxxQuVN55EepS9eumnr4eg6a8g2Ezdf9O572yLRKengftbJt nigIbc6lOxfgFP3NaSU+Ow5RBdT8SMUznDM0KmPrfqBunnZaxjLFlrB1yiBEJ5N/MltB sMuA== MIME-Version: 1.0 X-Received: by 10.52.1.39 with SMTP id 7mr24139659vdj.17.1409635121670; Mon, 01 Sep 2014 22:18:41 -0700 (PDT) Received: by 10.220.210.199 with HTTP; Mon, 1 Sep 2014 22:18:41 -0700 (PDT) Received: by 10.220.210.199 with HTTP; Mon, 1 Sep 2014 22:18:41 -0700 (PDT) In-Reply-To: <83870647-9F12-47AB-9790-7FD3B1806EDF@gmail.com> References: <83870647-9F12-47AB-9790-7FD3B1806EDF@gmail.com> Date: Tue, 2 Sep 2014 10:48:41 +0530 Message-ID: Subject: Re: Hadoop InputFormat - Processing large number of small files From: rab ra To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=e0cb4e887c09b0b15e05020e40fd X-Virus-Checked: Checked by ClamAV on apache.org --e0cb4e887c09b0b15e05020e40fd Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi > > > > I tried to use your CombileFileInputFormat implementation. However, I get the following exception > > > > =E2=80=98not org.apache.hadoop.mapred.InputFormat=E2=80=99 > > > > I am using hadoop 2.4.1 and it looks like it expect older interface as it does not accept =E2=80=98org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat=E2=80= =99. May I know what version of Hadoop you used? > > > > > > Looks like I need to use older one =E2=80=98org.apache.hadoop.mapred.lib.CombineFileInputFormat=E2=80=99 ? > > > > Thanks and Regards > > rab On 20 Aug 2014 22:59, "Felix Chern" wrote: > I wrote a post on how to use CombineInputFormat: > > http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-usin= g-combinefileinputformat-1/ > In the RecordReader constructor, you can get the context of which file yo= u > are reading in. > In my example, I created FileLineWritable to include the filename in the > mapper input key. > Then you can use the input key as: > > public static class TestMapper extends Mapper Text, IntWritable>{ private Text txt =3D new Text(); private IntWritable > count =3D new IntWritable(1); public void map (FileLineWritable key, Text > val, Context context) throws IOException, InterruptedException{ > StringTokenizer st =3D new StringTokenizer(val.toString()); while (st. > hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write( > txt, count); } } } > > > Cheers, > Felix > > > On Aug 20, 2014, at 8:19 AM, rab ra wrote: > > Thanks for the response. > > Yes, I know wholeFileInputFormat. But i am not sure filename comes to map > process either as key or value. But, I think this file format reads the > contents of the file. I wish to have a inputformat that just gives filena= me > or list of filenames. > > Also, files are very small. The wholeFileInputFormat spans one map proces= s > per file and thus results huge number of map processes. I wish to span a > single map process per group of files. > > I think I need to tweak CombineFileInputFormat's recordreader() so that i= t > does not read the entire file but just filename. > > > regards > rab > > regards > Bala > > > On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus > wrote: > >> Have you looked at the WholeFileInputFormat implementations? There are >> quite a few if search for them... >> >> >> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-ha= doop.html >> >> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/W= holeFileInputFormat.java >> >> Regards, >> Shahab >> >> >> On Wed, Aug 20, 2014 at 1:46 AM, rab ra wrote: >> >>> Hello, >>> >>> I have a use case wherein i need to process huge set of files stored in >>> HDFS. Those files are non-splittable and they need to be processed as a >>> whole. Here, I have the following question for which I need answers to >>> proceed further in this. >>> >>> 1. I wish to schedule the map process in task tracker where data is >>> already available. How can I do it? Currently, I have a file that conta= ins >>> list of filenames. Each map get one line of it via NLineInputFormat. Th= e >>> map process then accesses the file via FSDataInputStream and work with = it. >>> Is there a way to ensure this map process is running on the node where = the >>> file is available?. >>> >>> 2. Since the files are not large and it can be called as 'small' files >>> by hadoop standard. Now, I came across CombineFileInputFormat that can >>> process more than one file in a single map process. What I need here i= s a >>> format that can process more than one files in a single map but does no= t >>> have to read the files, and either in key or value, it has the filename= s. >>> In map process then, I can run a loop to process these files. Any help? >>> >>> 3. Any othe alternatives? >>> >>> >>> >>> regards >>> rab >>> >>> >> > > --e0cb4e887c09b0b15e05020e40fd Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Hi
>
> =C2=A0
>
> I tried to use your CombileFileInputFormat implementation. However, I = get the following exception
>
> =C2=A0
>
> =E2=80=98not org.apache.hadoop.mapred.InputFormat=E2=80=99
>
> =C2=A0
>
> I am using hadoop 2.4.1 and it looks like it expect older interface as= it does not accept =E2=80=98org.apache.hadoop.mapreduce.lib.input.CombineF= ileInputFormat=E2=80=99. =C2=A0May I know what version of Hadoop you used?<= br> >
> =C2=A0
>
> =C2=A0
>
> Looks like I need to use older one =E2=80=98org.apache.hadoop.mapred.l= ib.CombineFileInputFormat=E2=80=99 ?
>
> =C2=A0
>
> Thanks and Regards
>
> rab

On 20 Aug 2014 22:59, "Felix Chern" &l= t;idryman@gmail.com> wrote:
I wrote a post on how to use CombineInp= utFormat:
In the RecordReader constructor, you can get the context of which file= you are reading in.
In my example, I created FileLineWritable to= include the filename in the mapper input key.
Then you can use t= he input key as:



Cheers,
Felix
=

On Aug 20, 2014, at 8:19 AM, rab ra <rabmdu@gmail.com> wr= ote:

Thanks for the response.
Yes, I know wholeFileInputFormat. But i am not sure filenam= e comes to map process either as key or value. But, I think this file forma= t reads the contents of the file. I wish to have a inputformat that just gi= ves filename or list of filenames.

Also, files are very small. The wholeFileInputFormat sp= ans one map process per file and thus results huge number of map processes.= I wish to span a single map process per group of files.=C2=A0
I think I need to tweak CombineFileInputFormat's recordreade= r() so that it does not read the entire file but just filename.
<= br>

regards
rab

regards
Bala


<= div class=3D"gmail_quote">On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <shahab.yunus@gmail.com> wrote:
Have you looked at the Whol= eFileInputFormat implementations? There are quite a few if search for them.= ..


Regards,
Shahab


On Wed, Aug 20, 2014 at= 1:46 AM, rab ra <rabmdu@gmail.com> wrote:
Hello,
<= div>
I have a use case wherein i need to process huge set of files = stored in HDFS. Those files are non-splittable and they need to be processe= d as a whole. Here, I have the following question for which I need answers = to proceed further in this.

1. =C2=A0I wish to schedule the map process in task= tracker where data is already available. How can I do it? Currently, I hav= e a file that contains list of filenames. Each map get one line of it via N= LineInputFormat. The map process then accesses the file via FSDataInputStre= am and work with it. Is there a way to ensure this map process is running o= n the node where the file is available?.=C2=A0

2. =C2=A0Since the files are not large a= nd it can be called as 'small' files by hadoop standard. Now, I cam= e across CombineFileInputFormat that can process more than one file in a si= ngle map process. =C2=A0What I need here is a format that can process more = than one files in a single map but does not have to read the files, and eit= her in key or value, it has the filenames. In map process then, I can run a= loop to process these files. Any help?

3. Any othe alternatives?


regards
rab




--e0cb4e887c09b0b15e05020e40fd--
public static class TestMapper <= span style=3D"font-weight:bold">extends Mapper<FileLineWritable, Text, Text, <= span>IntWritable>{
private Text txt =3D new Text();
private IntWrit= able count =3D = new IntWritable(1);
public void map (<= /span>FileLineWritable key, Text val, Context context) throws = IOException, Int= erruptedException{
StringTokenizer st =3D new <= span style=3D"color:rgb(153,0,0);font-weight:bold">StringTokenizer(val.toString());
while (st.hasMoreTokens()){
txt.set(ke= y.fileName + st.nextToken());
context.<= /span>write(txt, = count);
}
}
}