David,

You are using†FileNameTextInputFormat. This is not in Hadoop source, as far as I can see. Can you please confirm where this is being used from ? It seems like the isSplittable method of this input format may need checking.

Another thing, given you are adding the same input format for all files, do you need MultipleInputs ?

Thanks
Hemanth


On Thu, Dec 6, 2012 at 1:06 PM, David Parks <davidparks21@yahoo.com> wrote:

I believe I just tracked down the problem, maybe you can help confirm if youíre familiar with this.

I see that FileInputFormat is specifying that gzip files (.gz extension) from s3n filesystem are being reported as splittable, and I see that itís creating multiple input splits for these files. Iím mapping the files directly off S3:

†††††† Path lsDir = new Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

†††††† MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

I see in the map phase, based on my counters, that itís actually processing the entire file (I set up a counter per file input). So the 2 files which were processed twice had 2 splits (I now see that in some debug logs I created), and the 1 file that was processed 3 times had 3 splits (the rest were smaller and were only assigned one split by default anyway).

Am I wrong in expecting all files on the s3n filesystem to come through as not-splittable? This seems to be a bug in hadoop code if Iím right.

David

From: Raj Vishwanathan [mailto:rajvish@yahoo.com]
Sent: Thursday, December 06, 2012 1:45 PM
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times

Could it be due to spec-ex? Does it make a diffrerence in the end?

Raj


From: David Parks <davidparks21@yahoo.com>
To: user@hadoop.apache.org
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing some files multiple times

Iíve got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times.

This is the code I use to set up the mapper:

†††††† Path lsDir = new Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");

†††††† for(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified linkshare catalog: " + f.getPath().toString());

†††††† if( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 ){

††††††††††††† MultipleInputs.addInputPath(job, lsDir, FileNameTextInputFormat.class, LinkShareCatalogImportMapper.class);

†††††† }

I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files.

I also have the following confirmation that it found the 167 files correctly:

2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167

When I look through the syslogs I can see that the file in question was opened by two different map attempts:

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000005_0/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading

./task-attempts/job_201212060351_0001/attempt_201212060351_0001_m_000173_0/syslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading

This is only happening to these 3 files, all others seem to be fine. For the life of me I canít see a reason why these files might be processed multiple times.

Notably, map attempt 173 is more map attempts than should be possible. There are 167 input files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176 map tasks.

Any thoughts/ideas/guesses?