hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Forsberg (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-6290) AutoInputFormat + (larger) bzip2 files cause multiple runs over same file
Date Tue, 29 Sep 2009 09:12:15 GMT
AutoInputFormat + (larger) bzip2 files cause multiple runs over same file
-------------------------------------------------------------------------

                 Key: HADOOP-6290
                 URL: https://issues.apache.org/jira/browse/HADOOP-6290
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 0.18.3
            Reporter: Erik Forsberg


Running a streaming job with the input directory containing a few .bzip2 files, each with
a size of roughly 110MiB (compressed), with -inputformat
org.apache.hadoop.streaming.AutoInputFormat on the streaming commandline, each file is processed
twice, i.e., if there are two bzip2 files in the directory, four mappers will be run. 

Running a wordcount M/R job, the resulting count is doubled which indicates that each input
file is analysed twice.

This was discovered while trying out dumbo, which uses AutoInputFormat by default. See http://groups.google.com/group/dumbo-user/browse_frm/thread/84b04b2320d4bbb0?hl=en

It seems this can't be reproduced on small files. It is possible the file has to be larger
than the DFS blocksize, in my case set to 64MiB.

I'm using Cloudera's hadoop distribution, version 0.18.3-6cloudera0.3.0~intrepid.

Please let me know if I need to provider further details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message