hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
Date Mon, 19 Mar 2018 23:27:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405583#comment-16405583

Chris Douglas commented on HADOOP-15320:

What testing has been done with this, already?

bq. do think it will need be bounced past the various tools, including: hive, spark, pig to
see that it all goes OK. But given S3A is using that default with no adverse consequences,
I think you'll be right.
Wouldn't one expect the same results, if the pattern worked for S3A? One would expect to find
framework code that is unnecessarily serial after this change. What tests did S3A run that
should be repeated?

bq. which endpoints did you run the entire hadoop-azure and hadoop-azuredatalake test suites?
Running these integration tests is a good idea. It's why they're there, after all.

> Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
> ----------------------------------------------------------------------------------
>                 Key: HADOOP-15320
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15320
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/adl, fs/azure
>    Affects Versions: 2.7.3, 2.9.0, 3.0.0
>            Reporter: shanyu zhao
>            Assignee: shanyu zhao
>            Priority: Major
>         Attachments: HADOOP-15320.patch
> hadoop-azure and hadoop-azure-datalake have its own implementation of getFileBlockLocations(),
which faked a list of artificial blocks based on the hard-coded block size. And each block
has one host with name "localhost". Take a look at this code:
> [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485]
> This is a unnecessary mock up for a "remote" file system to mimic HDFS. And the problem
with this mock is that for large (~TB) files we generates lots of artificial blocks, and FileInputFormat.getSplits()
is slow in calculating splits based on these blocks.
> We can safely remove this customized getFileBlockLocations() implementation, fall back
to the default FileSystem.getFileBlockLocations() implementation, which is to return 1 block
for any file with 1 host "localhost". Note that this doesn't mean we will create much less
splits, because the number of splits is still limited by the blockSize in FileInputFormat.computeSplitSize():
> {code:java}
> return Math.max(minSize, Math.min(goalSize, blockSize));{code}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message