hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Parks" <davidpark...@yahoo.com>
Subject Map tasks processing some files multiple times
Date Thu, 06 Dec 2012 06:15:03 GMT
I've got a job that reads in 167 files from S3, but 2 of the files are being
mapped twice and 1 of the files is mapped 3 times.


This is the code I use to set up the mapper:


       Path lsDir = new

       for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir))
log.info("Identified linkshare catalog: " + f.getPath().toString());

       if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

              MultipleInputs.addInputPath(job, lsDir,
FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);



I can see from the logs that it sees only 1 copy of each of these files, and
correctly identifies 167 files.


I also have the following confirmation that it found the 167 files


2012-12-06 04:56:41,213 INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
paths to process : 167


When I look through the syslogs I can see that the file in question was
opened by two different map attempts:


yslog:2012-12-06 03:56:05,265 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
rse~85.csv.gz' for reading

yslog:2012-12-06 03:53:18,765 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
rse~85.csv.gz' for reading


This is only happening to these 3 files, all others seem to be fine. For the
life of me I can't see a reason why these files might be processed multiple


Notably, map attempt 173 is more map attempts than should be possible. There
are 167 input files (from S3, gzipped), thus there should be 167 map
attempts. But I see a total of 176 map tasks.


Any thoughts/ideas/guesses?


View raw message