hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Huafeng Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12222) Add EC information to BlockLocation
Date Thu, 10 Aug 2017 08:01:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121223#comment-16121223
] 

Huafeng Wang commented on HDFS-12222:
-------------------------------------

I've checked the related code and found it is not easy to provide other functions to get parity
or data blocks.
The problem is, LocatedFileStatus is a subclass of FileStatus, both located in the hadoop-common
module, which does not have file related erasure coding policy information. Without that specific
policy information, LocatedFileStatus has no idea which BlockLocation is actually a parity
block. 

After discussed with Kai offline, one approach is to add an ECSchema into LocatedFileStatus
so that we can determine which blocks are parity blocks if erasure coding is enabled. 
Any suggestions here? Thanks.

> Add EC information to BlockLocation
> -----------------------------------
>
>                 Key: HDFS-12222
>                 URL: https://issues.apache.org/jira/browse/HDFS-12222
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Andrew Wang
>            Assignee: Huafeng Wang
>              Labels: hdfs-ec-3.0-nice-to-have
>
> HDFS applications query block location information to compute splits. One example of
this is FileInputFormat:
> https://github.com/apache/hadoop/blob/d4015f8628dd973c7433639451a9acc3e741d2a2/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L346
> You see bits of code like this that calculate offsets as follows:
> {noformat}
>     long bytesInThisBlock = blkLocations[startIndex].getOffset() + 
>                           blkLocations[startIndex].getLength() - offset;
> {noformat}
> EC confuses this since the block locations include parity block locations as well, which
are not part of the logical file length. This messes up the offset calculation and thus topology/caching
information too.
> Applications can figure out what's a parity block by reading the EC policy and then parsing
the schema, but it'd be a lot better if we exposed this more generically in BlockLocation
instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message