hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raj Vishwanathan <rajv...@yahoo.com>
Subject Re: Map tasks processing some files multiple times
Date Thu, 06 Dec 2012 06:45:26 GMT
Could it be due to spec-ex? Does it make a diffrerence in the end?


> From: David Parks <davidparks21@yahoo.com>
>To: user@hadoop.apache.org 
>Sent: Wednesday, December 5, 2012 10:15 PM
>Subject: Map tasks processing some files multiple times
>I’ve got a job that reads in 167 files from S3, but 2 of the files are being mapped
twice and 1 of the files is mapped 3 times.
>This is the code I use to set up the mapper:
>       Path lsDir = newPath("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");
>       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
linkshare catalog: "+ f.getPath().toString());
>       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length> 0 ){
>              MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class,
>       }
>I can see from the logs that it sees only 1 copy of each of these files, and correctly
identifies 167 files.
>I also have the following confirmation that it found the 167 files correctly:
>2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main):
Total input paths to process : 167
>When I look through the syslogs I can see that the file in question was opened by two
different map attempts:
03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
for reading
03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
for reading
>This is only happening to these 3 files, all others seem to be fine. For the life of me
I can’t see a reason why these files might be processed multiple times.
>Notably, map attempt 173 is more map attempts than should be possible. There are 167 input
files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176
map tasks.
>Any thoughts/ideas/guesses?
View raw message