hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcelo Vanzin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12534) Provide logical BlockLocations for EC files for better split calculation
Date Sat, 23 Sep 2017 01:36:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16177425#comment-16177425

Marcelo Vanzin commented on HDFS-12534:

bq. Are you sure we can split within a single S3 file?

Location != split. You can have x splits all with the same location. I'm pretty sure reading
from a single s3 file using FileInputFormat generates multiple tasks (one per "split"). You
may want to look at how it does that, it might be all client-side based on some client-side

> Provide logical BlockLocations for EC files for better split calculation
> ------------------------------------------------------------------------
>                 Key: HDFS-12534
>                 URL: https://issues.apache.org/jira/browse/HDFS-12534
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: erasure-coding
>    Affects Versions: 3.0.0-beta1
>            Reporter: Andrew Wang
>              Labels: hdfs-ec-3.0-must-do
> I talked to [~vanzin] and [~alex.behm] some more about split calculation with EC. It
turns out HDFS-12222 was resolved prematurely. Applications depend on HDFS BlockLocation to
understand where the split points are. The current scheme of returning one BlockLocation per
block group loses this information.
> We should change this to provide logical blocks. Divide the file length by the block
size and provide suitable BlockLocations to match, with virtual offsets and lengths too.
> I'm not marking this as incompatible, since changing it this way would in fact make it
more compatible from the perspective of applications that are scheduling against replicated
files. Thus, it'd be good for beta1 if possible, but okay for later too.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message