hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12534) Provide logical BlockLocations for EC files for better split calculation
Date Sat, 23 Sep 2017 01:12:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16177409#comment-16177409

Andrew Wang commented on HDFS-12534:

This ends up being kind of complicated, since we don't have the preferredBlockSize in the
LocatedBlock. We do have it in the FileStatus, but some of the client APIs only return a BlockLocation
and don't query a FileStatus.

The most efficient solution is to add preferredBlockSize to the LocatedBlock proto. We already
have some EC-specific fields for the LocatedStripedBlock subclass. It's hard to plumb this
though, since LocatedBlock is created pretty far down in BlockManager, and the preferredBlockSize
comes from the file in FSNamesystem.

We could also make the client make another RPC to get the FileStatus for EC files. This would
be for the APIs that take a path and return a BlockLocation, since the LocatedFileStatus APIs
already have a FileStatus. This comes at a performance cost.

I lean toward the efficient option. I didn't have time to plumb preferredBlockSize into the
LocatedBlock today. I'm going to unassign myself for now in case [~HuafengWang] or someone
else would like to pick this up.

Sidenote for [~vanzin], I checked S3AFileSystem and it looks like we just return a single
location per file (the dummy FileSystem implementation), which [~fabbri] confirmed. Are you
sure we can split within a single S3 file?

> Provide logical BlockLocations for EC files for better split calculation
> ------------------------------------------------------------------------
>                 Key: HDFS-12534
>                 URL: https://issues.apache.org/jira/browse/HDFS-12534
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: erasure-coding
>    Affects Versions: 3.0.0-beta1
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>              Labels: hdfs-ec-3.0-must-do
> I talked to [~vanzin] and [~alex.behm] some more about split calculation with EC. It
turns out HDFS-12222 was resolved prematurely. Applications depend on HDFS BlockLocation to
understand where the split points are. The current scheme of returning one BlockLocation per
block group loses this information.
> We should change this to provide logical blocks. Divide the file length by the block
size and provide suitable BlockLocations to match, with virtual offsets and lengths too.
> I'm not marking this as incompatible, since changing it this way would in fact make it
more compatible from the perspective of applications that are scheduling against replicated
files. Thus, it'd be good for beta1 if possible, but okay for later too.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message