From user-return-3414-apmail-hadoop-user-archive=hadoop.apache.org@hadoop.apache.org Thu Dec 6 06:16:03 2012 Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C3AC7D170 for ; Thu, 6 Dec 2012 06:16:03 +0000 (UTC) Received: (qmail 89969 invoked by uid 500); 6 Dec 2012 06:15:57 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 89340 invoked by uid 500); 6 Dec 2012 06:15:53 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 89291 invoked by uid 99); 6 Dec 2012 06:15:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2012 06:15:51 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [98.139.212.181] (HELO nm22.bullet.mail.bf1.yahoo.com) (98.139.212.181) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2012 06:15:39 +0000 Received: from [98.139.212.148] by nm22.bullet.mail.bf1.yahoo.com with NNFMP; 06 Dec 2012 06:15:18 -0000 Received: from [98.139.211.193] by tm5.bullet.mail.bf1.yahoo.com with NNFMP; 06 Dec 2012 06:15:17 -0000 Received: from [127.0.0.1] by smtp202.mail.bf1.yahoo.com with NNFMP; 06 Dec 2012 06:15:17 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1354774517; bh=bexmOvpg/ZUoz2ODVazcUCZ+lnF3QeCxorUKz5dPep0=; h=X-Yahoo-Newman-Id:X-Yahoo-Newman-Property:X-YMail-OSG:X-Yahoo-SMTP:Received:From:To:Subject:Date:Message-ID:MIME-Version:Content-Type:X-Mailer:Thread-Index:Content-Language; b=rWCWyjtRnx0zLttJED5vJdE71kpZyT5nYLixSUhRnoJQT9URr36v/nyC2zd/YDD3yHyy2vbJKyUQJfnmaUVikC32bqL2kJY0CwFNi5ZCNovHSJ5E/QBxMhPMKyOjZIov2yqwX/oNJX5CwEFYb1yCEEd6XyDFE+9xdHXBZ+jTa2E= X-Yahoo-Newman-Id: 864338.29256.bm@smtp202.mail.bf1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: hWVhkLMVM1le1YGQAUEhALkSENXwThfnQFQ7w_zt0xuElvU tEYrUvbZV.mUx2BOucGF6PbWP0Y42tDFfLgfnsMXdjp3S2vdO9nq1xXSadbk ciF7xVj4PhHWnzvEiPIwnoeBBmdPz3atcM36Z1T2iVs8KGqMnkKrs4nm47tG QKLGByzlIhpM9b5RoeleD7PbhC4._.FQLBevkUwLXCSD6zRLgPoK54q8ZI_p o_BUqzOJWJ56_ulg8CfJDcUs5O2R9oWrEefOlPa.WW.hj07G7NhhX9kwC.i4 7sOiobrtUmlOI784TwOK5NDveoqXGIW_GY2flXCBvvhmnZQpSqZMrBuSs.D7 hXv_qMZJ.jSMNqUeVKLCCTP2xhfGhX9NnRxXGkodbJiabRWlTvWstG8ig6WZ hzK1B8nM9V8HO01uhqq23rJ7yEyc9AwBAsHHoedLHcUQUFXKL2Vwi9IPmf_Q p0.6GksQQuujmlTM9Kb0mFXELJ.J5wucjrGpGXEZsWE8lqFfl0mL.Ist6Apa XHROiDyaj1lS1gWCQaDN6gwzKWe4- X-Yahoo-SMTP: k2gD1GeswBAV_JFpZm8dmpTCwr4ufTKOyA-- Received: from sattelite (davidparks21@113.161.75.108 with login) by smtp202.mail.bf1.yahoo.com with SMTP; 05 Dec 2012 22:15:17 -0800 PST From: "David Parks" To: Subject: Map tasks processing some files multiple times Date: Thu, 6 Dec 2012 13:15:03 +0700 Message-ID: <059401cdd379$093452e0$1b9cf8a0$@yahoo.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0595_01CDD3B3.B594D890" X-Mailer: Microsoft Outlook 14.0 Thread-Index: Ac3TddS7wATOorxTRtO79YpyrrvPSw== Content-Language: en-us X-Virus-Checked: Checked by ClamAV on apache.org This is a multipart message in MIME format. ------=_NextPart_000_0595_01CDD3B3.B594D890 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit I've got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times. This is the code I use to set up the mapper: Path lsDir = new Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*"); for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified linkshare catalog: " + f.getPath().toString()); if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){ MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class); } I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files. I also have the following confirmation that it found the 167 files correctly: 2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167 When I look through the syslogs I can see that the file in question was opened by two different map attempts: ./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/s yslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive rse~85.csv.gz' for reading ./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/s yslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Unive rse~85.csv.gz' for reading This is only happening to these 3 files, all others seem to be fine. For the life of me I can't see a reason why these files might be processed multiple times. Notably, map attempt 173 is more map attempts than should be possible. There are 167 input files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176 map tasks. Any thoughts/ideas/guesses? ------=_NextPart_000_0595_01CDD3B3.B594D890 Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

I’ve got a job that reads in 167 files from S3, but 2 of the = files are being mapped twice and 1 of the files is mapped 3 = times.

 

This is the code I use to set up the mapper:

 

  &= nbsp;    Path lsDir =3D new<= /b> = Path("s3n:/= /fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

  &= nbsp;    for<= /b>(FileStatus = f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log<= span = style=3D'font-size:10.0pt;font-family:Consolas;color:black'>.info(= "Ident= ified linkshare catalog: " + = f.getPath().toString());

  &= nbsp;    if( = lsDir.getFileSystem(getConf()).globStatus(lsDir).length = > 0 ){

  &= nbsp;           = MultipleInputs.addInputPath(job, lsDir, = FileNameTextInputFormat.class, = LinkShareCatalogImportMapper.class);

  &= nbsp;    }

 

I can see from the logs that it sees only 1 copy of each of these = files, and correctly identifies 167 files.

 

I also have the following confirmation that it found the 167 files = correctly:

 

2012-12-06 04:56:41,213 INFO = org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total = input paths to process : 167

 

When I look through the syslogs I can see that the file in question = was opened by two different map attempts:

 

./task-attempts/job_201212060351_0001/attempt_2012= 12060351_0001_m_000005_0/syslog:2012-12-06 03:56:05,265 INFO = org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening = 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Un= iverse~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_2012= 12060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO = org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening = 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Un= iverse~85.csv.gz' for reading

 

This is only happening to these 3 files, all others seem to be fine. = For the life of me I can’t see a reason why these files might be = processed multiple times.

 

Notably, map attempt 173 is more map attempts than should be = possible. There are 167 input files (from S3, gzipped), thus there = should be 167 map attempts. But I see a total of 176 map = tasks.

 

Any thoughts/ideas/guesses?

 

------=_NextPart_000_0595_01CDD3B3.B594D890--