hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-14943) S3A to implement getFileBlockLocations() for mapred partitioning
Date Wed, 11 Oct 2017 14:28:00 GMT
Steve Loughran created HADOOP-14943:

             Summary: S3A to implement getFileBlockLocations() for mapred partitioning
                 Key: HADOOP-14943
                 URL: https://issues.apache.org/jira/browse/HADOOP-14943
             Project: Hadoop Common
          Issue Type: Sub-task
          Components: fs/s3
    Affects Versions: 2.8.1
            Reporter: Steve Loughran
            Priority: Critical

It looks suspiciously like S3A isn't providing the partitioning data needed in {{listLocatedStatus}}
and {{getFileBlockLocations()}} needed to break up a file by the blocksize. This will stop
tools using the MRv1 APIS doing the partitioning properly if the input format isn't doing
it own split logic.

FileInputFormat in MRv2 is a bit more configurable about input split calculation & will
split up large files. but otherwise, the partitioning is being done more by the default values
of the executing engine, rather than any config data from the filesystem about what its "block
size" is,

NativeAzureFS does a better job; maybe that could be factored out to hadoop-common and reused?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message