Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 42D86EFED for ; Thu, 6 Dec 2012 14:44:34 +0000 (UTC) Received: (qmail 56804 invoked by uid 500); 6 Dec 2012 14:44:29 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 56416 invoked by uid 500); 6 Dec 2012 14:44:28 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 56383 invoked by uid 99); 6 Dec 2012 14:44:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2012 14:44:27 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of hemanty@thoughtworks.com designates 64.18.0.24 as permitted sender) Received: from [64.18.0.24] (HELO exprod5og112.obsmtp.com) (64.18.0.24) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2012 14:44:14 +0000 Received: from mail-vc0-f198.google.com ([209.85.220.198]) (using TLSv1) by exprod5ob112.postini.com ([64.18.4.12]) with SMTP ID DSNKUMCvKRfRVOwHI8nScf8olM9KBSVLn/YR@postini.com; Thu, 06 Dec 2012 06:43:54 PST Received: by mail-vc0-f198.google.com with SMTP id n11so9417467vch.5 for ; Thu, 06 Dec 2012 06:43:52 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=Q1We+sjqlEQFaBKoaWE49UjWV8XZmXa560aevEWXJRY=; b=QNYlt7Vxdr7dcz5UPEp//uCmBG7W3+XDU5bQDWxhvV+E0MXI/BHSEA4SXbre6LHDt7 h2f31wsX2KprOLM/5ElzGEeSsyvUtODKx1zxqitZ8JOWBYDBVXUl1MRPyrmwGjGbwGhi OrotBxSarNfioBdKoBQnUBJ5yfh8AwfDoyscx5DGP7HXaPnPIhCkW3F0FecRHoTqDJ/L YYPfLR61CTmfs6EOq8csZr77qSa+9ojnN7pJak2SvORgPVYBVR8IJHthRHwZ3b9ZcQuQ g2S8c+vpd/A0qxfW5MgGllRPwI/HNxRvvQj5h2xQ9piCbQ4k1PXrr7DQpwpZK9mRMBGj cL5w== Received: by 10.58.232.226 with SMTP id tr2mr1177955vec.48.1354805032813; Thu, 06 Dec 2012 06:43:52 -0800 (PST) MIME-Version: 1.0 Received: by 10.58.232.226 with SMTP id tr2mr1177945vec.48.1354805032651; Thu, 06 Dec 2012 06:43:52 -0800 (PST) Received: by 10.58.134.13 with HTTP; Thu, 6 Dec 2012 06:43:52 -0800 (PST) In-Reply-To: <05ed01cdd394$41c51cf0$c54f56d0$@yahoo.com> References: <059401cdd379$093452e0$1b9cf8a0$@yahoo.com> <1354776326.401.YahooMailNeo@web160704.mail.bf1.yahoo.com> <05bc01cdd384$70d78a10$52869e30$@yahoo.com> <05ed01cdd394$41c51cf0$c54f56d0$@yahoo.com> Date: Thu, 6 Dec 2012 20:13:52 +0530 Message-ID: Subject: Re: Map tasks processing some files multiple times From: Hemanth Yamijala To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e013a11a2b60b6504d03020d7 X-Gm-Message-State: ALoCoQm/AF60uJ+OV1/vPMKyNSiscVF2U1y6ZRfN23mxp9IIfvcyWKUzoLs+XfsozXqyVZ1NdlrClZ/7fumD2lInGWqAlrsBneSI6Z3zccE+cCUXX6MMEIYBPPauKMVInLypPTdmnrEggXnYiIKdJYTRx8mntI/QTg== X-Virus-Checked: Checked by ClamAV on apache.org --089e013a11a2b60b6504d03020d7 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Glad it helps. Could you also explain the reason for using MultipleInputs ? On Thu, Dec 6, 2012 at 2:59 PM, David Parks wrote: > Figured it out, it is, as usual, with my code. I had wrapped > TextInputFormat to replace the LongWritable key with a key representing t= he > file name. It was a bit tricky to do because of changing the generics fro= m > to and I goofed up and mis-directed a > call to isSplittable, which was causing the issue.**** > > ** ** > > It now works fine. Thanks very much for the response, it gave me pause to > think enough to work out what I had done.**** > > ** ** > > Dave**** > > ** ** > > ** ** > > *From:* Hemanth Yamijala [mailto:yhemanth@thoughtworks.com] > *Sent:* Thursday, December 06, 2012 3:25 PM > > *To:* user@hadoop.apache.org > *Subject:* Re: Map tasks processing some files multiple times**** > > ** ** > > David,**** > > ** ** > > You are using FileNameTextInputFormat. This is not in Hadoop source, as > far as I can see. Can you please confirm where this is being used from ? = It > seems like the isSplittable method of this input format may need checking= . > **** > > ** ** > > Another thing, given you are adding the same input format for all files, > do you need MultipleInputs ?**** > > ** ** > > Thanks**** > > Hemanth**** > > ** ** > > On Thu, Dec 6, 2012 at 1:06 PM, David Parks > wrote:**** > > I believe I just tracked down the problem, maybe you can help confirm if > you=92re familiar with this.**** > > **** > > I see that FileInputFormat is specifying that gzip files (.gz extension) > from s3n filesystem are being reported as *splittable*, and I see that > it=92s creating multiple input splits for these files. I=92m mapping the = files > directly off S3:**** > > **** > > Path lsDir =3D *new* Path( > "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");**** > > MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.= * > class*, LinkShareCatalogImportMapper.*class*);**** > > **** > > I see in the map phase, based on my counters, that it=92s actually > processing the entire file (I set up a counter per file input). So the 2 > files which were processed twice had 2 splits (I now see that in some deb= ug > logs I created), and the 1 file that was processed 3 times had 3 splits > (the rest were smaller and were only assigned one split by default anyway= ). > **** > > **** > > Am I wrong in expecting all files on the s3n filesystem to come through a= s > not-splittable? This seems to be a bug in hadoop code if I=92m right.**** > > **** > > David**** > > **** > > **** > > *From:* Raj Vishwanathan [mailto:rajvish@yahoo.com] > *Sent:* Thursday, December 06, 2012 1:45 PM > *To:* user@hadoop.apache.org > *Subject:* Re: Map tasks processing some files multiple times**** > > **** > > Could it be due to spec-ex? Does it make a diffrerence in the end?**** > > **** > > Raj**** > > **** > ------------------------------ > > *From:* David Parks > *To:* user@hadoop.apache.org > *Sent:* Wednesday, December 5, 2012 10:15 PM > *Subject:* Map tasks processing some files multiple times**** > > **** > > I=92ve got a job that reads in 167 files from S3, but 2 of the files are > being mapped twice and 1 of the files is mapped 3 times.**** > > **** > > This is the code I use to set up the mapper:**** > > **** > > Path lsDir =3D *new* Path( > "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");**** > > *for*(FileStatus f : > lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified > linkshare catalog: " + f.getPath().toString());**** > > *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 > ){**** > > MultipleInputs.*addInputPath*(job, lsDir, > FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);**= * > * > > }**** > > **** > > I can see from the logs that it sees only 1 copy of each of these files, > and correctly identifies 167 files.**** > > **** > > I also have the following confirmation that it found the 167 files > correctly:**** > > **** > > 2012-12-06 04:56:41,213 INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input > paths to process : 167**** > > **** > > When I look through the syslogs I can see that the file in question was > opened by two different map attempts:**** > > **** > > ./task-attempts/job_201212060351_0001/* > attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO > org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening > 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Un= iverse~85.csv.gz' > for reading**** > > ./task-attempts/job_201212060351_0001/* > attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO > org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening > 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Un= iverse~85.csv.gz' > for reading**** > > **** > > This is only happening to these 3 files, all others seem to be fine. For > the life of me I can=92t see a reason why these files might be processed > multiple times.**** > > **** > > Notably, map attempt 173 is more map attempts than should be possible. > There are 167 input files (from S3, gzipped), thus there should be 167 ma= p > attempts. But I see a total of 176 map tasks.**** > > **** > > Any thoughts/ideas/guesses?**** > > **** > > **** > > ** ** > --089e013a11a2b60b6504d03020d7 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Glad it helps. Could you also explain the reason for using MultipleInputs ?=


On Thu, Dec 6= , 2012 at 2:59 PM, David Parks <davidparks21@yahoo.com>= wrote:

Figured it ou= t, it is, as usual, with my code. I had wrapped TextInputFormat to replace = the LongWritable key with a key representing the file name. It was a bit tr= icky to do because of changing the generics from <LongWritable, Text>= to <Text, Text> and I goofed up and mis-directed a call to isSplitta= ble, which was causing the issue.

=A0<= /p>

It now works fine. Tha= nks very much for the response, it gave me pause to think enough to work ou= t what I had done.

=A0<= /p>

Dave

=A0<= /p>

=A0

From: Hemanth = Yamijala [mailto:yhemanth@thoughtworks.com]
Sent: Thursday, December 06, 2012 3:25 PM


To: user@hadoop.apache.org
Subject: Re: Map tasks process= ing some files multiple times

=A0

David,

=A0

You are using= =A0FileNameTextInputF= ormat. This is not in Hadoop source, as far as I can see. Can you please co= nfirm where this is being used from ? It seems like the isSplittable method= of this input format may need checking.

=A0

Anothe= r thing, given you are adding the same input format for all files, do you n= eed MultipleInputs ?

=A0

Thanks=

Hemanth

=A0=

On Thu, Dec 6, 2012 at 1:06 PM, Davi= d Parks <dav= idparks21@yahoo.com> wrote:

I believe I jus= t tracked down the problem, maybe you can help confirm if you=92re familiar= with this.

=A0<= /p>

I see that FileInputFo= rmat is specifying that gzip files (.gz extension) from s3n filesystem are = being reported as splittable, and I see that it=92s creating multipl= e input splits for these files. I=92m mapping the files directly off S3:

=A0=

=A0=A0=A0=A0=A0=A0 Path lsDir =3D <= b>new Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*&quo= t;);

=A0=A0=A0=A0=A0=A0 MultipleInputs.= addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

=A0<= /p>

I see in the map phase= , based on my counters, that it=92s actually processing the entire file (I = set up a counter per file input). So the 2 files which were processed twice= had 2 splits (I now see that in some debug logs I created), and the 1 file= that was processed 3 times had 3 splits (the rest were smaller and were on= ly assigned one split by default anyway).

=A0<= /p>

Am I wrong in expectin= g all files on the s3n filesystem to come through as not-splittable? This s= eems to be a bug in hadoop code if I=92m right.

=A0<= /p>

David=

=A0<= /p>

=A0

From: Raj Vishwanathan [mailto:rajvish@yahoo.com]
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org<= br>Subject: Re: Map tasks processing some files multiple times

=A0

Could it be d= ue to spec-ex? Does it make a diffrerence in the end?<= /p>

=A0

Raj

=A0<= /span>


From: David Parks <davidparks21@yahoo.com>
To: user= @hadoop.apache.org
Sent: Wednesday, December 5, 2012 10:15 P= M
Subject: Map tasks processing some files multiple times
<= u>

=A0

I=92ve got a job that reads in 167 files from S3, b= ut 2 of the files are being mapped twice and 1 of the files is mapped 3 tim= es.

=A0

This is the code I use to set up the mapper:

=A0

=A0=A0=A0=A0=A0=A0 Path lsDir =3D new Path("s3n://= fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");<= u>

=A0=A0=A0=A0=A0=A0 for(FileStatus f : lsDir.getFileSystem(getCon= f()).globStatus(lsDir)) log.info("Identified linkshare catalog: " = + f.getPath().toString());

=A0=A0=A0=A0=A0=A0 if( lsDir.getFileSystem(getConf()).globStatus= (lsDir).length > 0 ){=

=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 MultipleInputs.addInputPath(job, lsDir, FileNameTextInputForm= at.class, LinkShareCatalogImpor= tMapper.class);

=A0=A0=A0=A0=A0=A0 }<= u>

=A0<= /p>

I can see from the logs that it sees only 1 copy of each of th= ese files, and correctly identifies 167 files.

=A0

I also have the following confirmation that it f= ound the 167 files correctly:

=A0

2012-12-06 04= :56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main= ): Total input paths to process : 167

=A0

When I look through the syslogs I can see that t= he file in question was opened by two different map attempts:=

=A0

./task= -attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0= /syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3= FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linksha= re_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_20121= 2060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO org.apache.= hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapre= duce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz= 9; for reading

=A0

This is only happening to these 3 files, all oth= ers seem to be fine. For the life of me I can=92t see a reason why these fi= les might be processed multiple times.

=A0

Notably, map attempt 173 is more map attempts th= an should be possible. There are 167 input files (from S3, gzipped), thus t= here should be 167 map attempts. But I see a total of 176 map tasks.=

=A0

Any thoughts/ideas/guesses?=

=A0

=A0

=A0


--089e013a11a2b60b6504d03020d7--