From user-return-3416-apmail-hadoop-user-archive=hadoop.apache.org@hadoop.apache.org Thu Dec 6 07:37:36 2012 Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0FF8ADF48 for ; Thu, 6 Dec 2012 07:37:36 +0000 (UTC) Received: (qmail 85022 invoked by uid 500); 6 Dec 2012 07:37:31 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 83951 invoked by uid 500); 6 Dec 2012 07:37:30 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 83850 invoked by uid 99); 6 Dec 2012 07:37:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2012 07:37:27 +0000 X-ASF-Spam-Status: No, hits=3.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.139.212.164] (HELO nm5.bullet.mail.bf1.yahoo.com) (98.139.212.164) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2012 07:37:18 +0000 Received: from [98.139.212.145] by nm5.bullet.mail.bf1.yahoo.com with NNFMP; 06 Dec 2012 07:36:57 -0000 Received: from [98.139.213.5] by tm2.bullet.mail.bf1.yahoo.com with NNFMP; 06 Dec 2012 07:36:57 -0000 Received: from [127.0.0.1] by smtp105.mail.bf1.yahoo.com with NNFMP; 06 Dec 2012 07:36:57 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1354779417; bh=macGSWRhbOalWAPt7od8eA9o+nGMV+7lzKYhxOLZmj8=; h=X-Yahoo-Newman-Id:X-Yahoo-Newman-Property:X-YMail-OSG:X-Yahoo-SMTP:Received:From:To:References:In-Reply-To:Subject:Date:Message-ID:MIME-Version:Content-Type:X-Mailer:Thread-Index:Content-Language; b=GeXPlno8M96JB/wGnfIPABoS3Nvgqhj3upHKpEHeFj0uVel6LKk4iS4UjEJzAgMlFfH2kIBRpFhlY4JLLJT6Ezm7m+fu5FNP4yiu7MUo0HXQrUjhU7CQrlB8RTZ8mPNBZGPBp4Fp5fwvrH2L5OXoEHJYX0UZS23UJfwy0GTn0r0= X-Yahoo-Newman-Id: 4335.36914.bm@smtp105.mail.bf1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: 08SwRH0VM1nRfx8CCDWibRWWCz.93iJUE.Ptml3J11u4bF5 mkjXLb271v1UeDWnSEy5at.CMZzrhHTdSVaAS7m.5yFJw19mrb7QyWRDuVZQ fHHTTmlD.Rwqa4_7bfCWEu9maetsqYcG9KULg18odENqjAbg11Xk43dLnkxy dzveEdeAMCwXxdG6FY8fkgI1G5BaXviFmk4VrmdDQwUzm1XNRvXT1wTGKDhU iOP2zSSl9zisxo53WVoHR0jPWPVfG8_9bv3m9I87HlcT0.epgpWaNFZq2juC SFhQ0osb1jgescSB9mcuXTAscJiM3JSoDXOLbRXrTWvIDWEnOeHRYEP8noMY e4rDK8.1odBwc8kgwYtT3sD45tYbnuOJvuJbVoN4PUgjK6ZtkdY_VbY8ufbK 5QEyVwLBc6yUbbQSiLTc4JkTPQrFYQPhduK793RsUGIlqxPgbtase2E7Z7fW 1rbfLvPc7hyyN9GmNtbdHyZFckVxOgyR1dv8y.xJ1EzAs0GGK6QJlBnjBJ5J QhfTAO67xg7I3JJnt.y_JZoYswas- X-Yahoo-SMTP: k2gD1GeswBAV_JFpZm8dmpTCwr4ufTKOyA-- Received: from sattelite (davidparks21@113.161.75.108 with login) by smtp105.mail.bf1.yahoo.com with SMTP; 05 Dec 2012 23:36:56 -0800 PST From: "David Parks" To: References: <059401cdd379$093452e0$1b9cf8a0$@yahoo.com> <1354776326.401.YahooMailNeo@web160704.mail.bf1.yahoo.com> In-Reply-To: <1354776326.401.YahooMailNeo@web160704.mail.bf1.yahoo.com> Subject: RE: Map tasks processing some files multiple times Date: Thu, 6 Dec 2012 14:36:41 +0700 Message-ID: <05bc01cdd384$70d78a10$52869e30$@yahoo.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_05BD_01CDD3BF.1D39BD70" X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQHUZY7MhTuvkLjnws1+UXpB19ThSgGg68Zgl/E0uUA= Content-Language: en-us X-Virus-Checked: Checked by ClamAV on apache.org This is a multipart message in MIME format. ------=_NextPart_000_05BD_01CDD3BF.1D39BD70 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I believe I just tracked down the problem, maybe you can help confirm if = you=E2=80=99re familiar with this. =20 I see that FileInputFormat is specifying that gzip files (.gz extension) = from s3n filesystem are being reported as splittable, and I see that = it=E2=80=99s creating multiple input splits for these files. I=E2=80=99m = mapping the files directly off S3: =20 Path lsDir =3D new = Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*"); MultipleInputs.addInputPath(job, lsDir, = FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class); =20 I see in the map phase, based on my counters, that it=E2=80=99s actually = processing the entire file (I set up a counter per file input). So the 2 = files which were processed twice had 2 splits (I now see that in some = debug logs I created), and the 1 file that was processed 3 times had 3 = splits (the rest were smaller and were only assigned one split by = default anyway). =20 Am I wrong in expecting all files on the s3n filesystem to come through = as not-splittable? This seems to be a bug in hadoop code if I=E2=80=99m = right. =20 David =20 =20 From: Raj Vishwanathan [mailto:rajvish@yahoo.com]=20 Sent: Thursday, December 06, 2012 1:45 PM To: user@hadoop.apache.org Subject: Re: Map tasks processing some files multiple times =20 Could it be due to spec-ex? Does it make a diffrerence in the end? =20 Raj =20 _____ =20 From: David Parks To: user@hadoop.apache.org=20 Sent: Wednesday, December 5, 2012 10:15 PM Subject: Map tasks processing some files multiple times =20 I=E2=80=99ve got a job that reads in 167 files from S3, but 2 of the = files are being mapped twice and 1 of the files is mapped 3 times. =20 This is the code I use to set up the mapper: =20 Path lsDir =3D new = Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*"); for(FileStatus f : = lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified = linkshare catalog: " + f.getPath().toString()); if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 = ){ MultipleInputs.addInputPath(job, lsDir, = FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class); } =20 I can see from the logs that it sees only 1 copy of each of these files, = and correctly identifies 167 files. =20 I also have the following confirmation that it found the 167 files = correctly: =20 2012-12-06 04:56:41,213 INFO = org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total = input paths to process : 167 =20 When I look through the syslogs I can see that the file in question was = opened by two different map attempts: =20 ./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_= 0/syslog:2012-12-06 03:56:05,265 INFO = org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening = 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Un= iverse~85.csv.gz' for reading ./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_= 0/syslog:2012-12-06 03:53:18,765 INFO = org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening = 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Un= iverse~85.csv.gz' for reading =20 This is only happening to these 3 files, all others seem to be fine. For = the life of me I can=E2=80=99t see a reason why these files might be = processed multiple times. =20 Notably, map attempt 173 is more map attempts than should be possible. = There are 167 input files (from S3, gzipped), thus there should be 167 = map attempts. But I see a total of 176 map tasks. =20 Any thoughts/ideas/guesses? =20 =20 ------=_NextPart_000_05BD_01CDD3BF.1D39BD70 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

I believe I just tracked down the problem, maybe you can help confirm = if you=E2=80=99re familiar with this.

 

I see that FileInputFormat is specifying that gzip files (.gz = extension) from s3n filesystem are being reported as splittable, = and I see that it=E2=80=99s creating multiple input splits for these = files. I=E2=80=99m mapping the files directly off = S3:

 

=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 Path lsDir =3D new<= /b> = Path("s3n:/= /fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 MultipleInputs.addInputPath(job, lsDir, = FileNameTextInputFormat.class, = LinkShareCatalogImportMapper.class);

 

I see in the map phase, based on my counters, that it=E2=80=99s = actually processing the entire file (I set up a counter per file input). = So the 2 files which were processed twice had 2 splits (I now see that = in some debug logs I created), and the 1 file that was processed 3 times = had 3 splits (the rest were smaller and were only assigned one split by = default anyway).

 

Am I wrong in expecting all files on the s3n filesystem to come = through as not-splittable? This seems to be a bug in hadoop code if = I=E2=80=99m right.

 

David

 

 

From:= = Raj Vishwanathan [mailto:rajvish@yahoo.com]
Sent: Thursday, = December 06, 2012 1:45 PM
To: = user@hadoop.apache.org
Subject: Re: Map tasks processing some = files multiple times

 

C= ould it be due to spec-ex? Does it make a diffrerence in the = end?

<= o:p> 

R= aj

<= o:p> 

<= hr size=3D1 width=3D"100%" align=3Dcenter>

F= rom: = David Parks <davidparks21@yahoo.com>
= To: user@hadoop.apache.org =
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: = Map tasks processing some files multiple times

 

I=E2=80=99ve got a job that reads in 167 files from S3, but 2 of the = files are being mapped twice and 1 of the files is mapped 3 = times.

 

This is the code I use to set up the mapper:

 

  &= nbsp;    Path lsDir =3D new<= /b> = Path("s3n:/= /fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

  &= nbsp;    for<= /b>(FileStatus = f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log<= span = style=3D'font-size:10.0pt;font-family:Consolas;color:black'>.info(= "Ident= ified linkshare catalog: " + = f.getPath().toString());

  &= nbsp;    if( = lsDir.getFileSystem(getConf()).globStatus(lsDir).length = > 0 ){

  &= nbsp;           = MultipleInputs.addInputPath(job, lsDir, = FileNameTextInputFormat.class, = LinkShareCatalogImportMapper.class);

  &= nbsp;    }

 =

I can see from the logs that it sees only 1 copy of each of these = files, and correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files = correctly:

 

2012-12-06 04:56:41,213 INFO = org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total = input paths to process : 167

 

When I look through the syslogs I can see that the file in question = was opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_201212= 060351_0001_m_000005_0/syslog:2012-12-06 03:56:05,265 INFO = org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening = 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Un= iverse~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_2012= 12060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO = org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening = 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Un= iverse~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. = For the life of me I can=E2=80=99t see a reason why these files might be = processed multiple times.

 

Notably, map attempt 173 is more map attempts than should be = possible. There are 167 input files (from S3, gzipped), thus there = should be 167 map attempts. But I see a total of 176 map = tasks.

 

Any thoughts/ideas/guesses?

 

 

------=_NextPart_000_05BD_01CDD3BF.1D39BD70--