avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Beech (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1234) Avro MapReduce jobs silently ignore input data without '.avro' extension
Date Thu, 17 Jan 2013 22:46:12 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556685#comment-13556685
] 

Dave Beech commented on AVRO-1234:
----------------------------------

Well, the fact the old and new APIs differ in their behaviour is clearly not ideal. But I
consider this a bugfix that hasn't been backported rather than a regression :)

If I had a directory containing a mixture of Avro and non-Avro files, and I gave that path
to AvroInputFormat to process, I'd fully expect the job to die a horrible death. Silently
discarding input feels wrong, especially so if it's valid Avro which just happens to be named
a certain way. A magic bytes check would be better, but on the whole I'm just not sure a check
is necessary. As far as I'm aware, none of the standard Hadoop input formats behave in this
way (happy to be corrected as I haven't checked them all!).

 
                
> Avro MapReduce jobs silently ignore input data without '.avro' extension
> ------------------------------------------------------------------------
>
>                 Key: AVRO-1234
>                 URL: https://issues.apache.org/jira/browse/AVRO-1234
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.7.3
>            Reporter: Dave Beech
>            Assignee: Dave Beech
>         Attachments: AVRO-1234.patch
>
>
> The AvroInputFormat class explicitly checks each input path for a '.avro' extension.

> If only some of the input paths have the correct extension, the remainder are silently
ignored and not included in the job. However, if none of the input paths have the extension,
the job will continue and succeed even though no map tasks are allocated, and no work is done.
> This only happens using the old mapred API. The new mapreduce API version will happily
read files regardless of extension. 
> Is the check necessary?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message