hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9040) Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests to Coordinator)
Date Sat, 19 Sep 2015 01:44:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876807#comment-14876807
] 

Jing Zhao commented on HDFS-9040:
---------------------------------

Good point about the GS bump! Some thinking along the line:
# First totally agree if the client finally commits the block, then based on the committed
length we can identify replicas reported from failed DN and exclude them (suppose checksum
is enabled). No GS bump is needed in this case.
# If the client fails before committing the block and thus the NN does not know the expected
length. To recovery lease, the NN may have to contact all the DataNodes and identify the "safe
length" of the block group. Because our current implementation (with GS bump) does not have
the guarantee that an internal block with higher GS must have longer safe length (since a
healthy streamer may be very slow and left way behind), the GS bump is kind of useless.
# Based on the above two, we have the first option: do not bump GS for failures during the
writing. Only need to bump GS for append in the future.
# However, this implementation lacks of a bound for safe length. When a failure happens and
is detected by the StripedOutputStream, nothing has been done to make sure some size of data
is finally successfully written until the block gets committed. It is very possible that when
the user finally hits a writing failure, only little data can be recovered.
# Another issue is, when NN restarts and receives block reports from DN, it's hard for it
to determine when to start the recovery. It is possible that it determines the safe length
too early (e.g., based on 6/9 reported internal blocks) and truncates too much data. Thus
the logic of the recovery based on the block lengths can be complicated.
# We can have option 2: to sync/flush data periodically. Similaly with QFS, we can flush the
data out for every 1MB or n stripes. Or we can choose flush the data only when failures are
detected. It will definitely be slower, but it's safer and some optimizations can be done:
e.g., we can rule out slow streamers if we still have enough number of healthy streamers.
And GS bump can come into the picture to help us simplify the recovery: we can guarantee that
a new GS indicates some level of safe length (since we flush the data to DN before the GS
bump). And when NN does the recovery later, GS can help it determine which DataNodes should
be included in the recovery process.

Thoughts?

> Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests to Coordinator)
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9040
>                 URL: https://issues.apache.org/jira/browse/HDFS-9040
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Walter Su
>         Attachments: HDFS-9040-HDFS-7285.002.patch, HDFS-9040-HDFS-7285.003.patch, HDFS-9040.00.patch,
HDFS-9040.001.wip.patch, HDFS-9040.02.bgstreamer.patch
>
>
> The general idea is to simplify error handling logic.
> Proposal 1:
> A BlockGroupDataStreamer to communicate with NN to allocate/update block, and StripedDataStreamer
s only have to stream blocks to DNs.
> Proposal 2:
> See below the [comment|https://issues.apache.org/jira/browse/HDFS-9040?focusedCommentId=14741388&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14741388]
from [~jingzhao].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message