hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7678) Erasure coding: DFSInputStream with decode functionality
Date Sat, 02 May 2015 00:12:06 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524353#comment-14524353
] 

Andrew Wang commented on HDFS-7678:
-----------------------------------

Nits / comment requests:

* extra imports in DFSStripedInputStream
* I noticed these config keys aren't present in hdfs-default.xml, which is where we document
the behavior of config keys. I bring it up because I was thinking about the semantic difference
between "overall timeout" vs. "per-op timeout". Since there are a couple EC keys that need
documentation, I'm okay punting this to a quick follow-on.
* These same keys could also use javadoc on the DFSClientConf getters so we don't need to
hunt in hdfs-default.xml or look at how the code uses the configs
* The importance of the largest read portion is not explained, nor any large comment about
the overall flow of the recovery logic. Some ascii art would help clarify all this, I drew
something like this:

{noformat}
      +--+  +--+  |  +--+  +--+
+--+  |  |  |  |  |  |  |  |  |
|  |  |  |  +--+  |  |  |  |  |
+--+  +--+        |  +--+  +--+
                  |
 d1    d2    d3   |   p1    p2
{noformat}

This way it becomes obvious, if d1 fails, you need to lengthen the read to d3 in addition
to grabbing the parity. The TODO to optimize the read also is clarified; we always calculate
the max span and always do max spans for recovery, even though we might only need the part
of d3 that's unfetched.

Related, I could see printing the "plan" like this to be nice for debugging. Also a nice illustration
for the test cases.

Code:
* AFAICT we don't respect deadNodes in scheduleOneStripedRead, or during the initial planning
phase. I mentioned this a bit in person, but we also need to think about the dead node marking
policy for EC reads, since the timeout is aggressive.
* I don't understand the overall timeout handling, since if the overall timeout has elapsed,
we still go through the switch and schedule recovery reads. Seems like something we should
unit test.
* We also probably still want a per-operation timeout in addition to the overall timeout.
If the per-op timeout hits, we keep trying to recover, until the overall timeout runs out
(or we run out of blocks).
* The InterruptedException catch should probably rethrow an InterruptedIOException or similar,
the idea is to get us out of this function ASAP while cleaning up state.

Recommended follow-on work:
* fetchBlockByteRange is an outsize function and could use some splitting. We have this function

* There's no abstraction here for cells or stripes, we're mucking directly with internal blocks
and byte ranges. IMO all the logic should happen on these higher-level abstractions, then
we turn it into blocks and byte ranges after. Max-span for instance is a hack, we should be
calculating recovery reads cell-by-cell based on what we already have, then combining the
cells for an internal block together to form the actual read.
* One thought regarding the threadpool. What we really want is to limit the concurrency per-DN,
i.e. a threadpool per DN. This works nicely with read combining mentioned above, and does
one better since it can coalesce recovery reads with already queued work. This is like how
disk IO queues work.

> Erasure coding: DFSInputStream with decode functionality
> --------------------------------------------------------
>
>                 Key: HDFS-7678
>                 URL: https://issues.apache.org/jira/browse/HDFS-7678
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7285
>            Reporter: Li Bo
>            Assignee: Zhe Zhang
>         Attachments: BlockGroupReader.patch, HDFS-7678-HDFS-7285.002.patch, HDFS-7678-HDFS-7285.003.patch,
HDFS-7678-HDFS-7285.004.patch, HDFS-7678-HDFS-7285.005.patch, HDFS-7678-HDFS-7285.006.patch,
HDFS-7678.000.patch, HDFS-7678.001.patch
>
>
> A block group reader will read data from BlockGroup no matter in striping layout or contiguous
layout. The corrupt blocks can be known before reading(told by namenode), or just be found
during reading. The block group reader needs to do decoding work when some blocks are found
corrupt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message