hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai Zheng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11847) Enhance raw coder allowing to read least required inputs in decoding
Date Fri, 22 May 2015 02:42:17 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555483#comment-14555483

Kai Zheng commented on HADOOP-11847:

Thanks for more review and comment!
bq. for findFirstValidInput, still one comment not addressed:
Sorry I missed to explain why the codes are like that. It was thinking that it's rarely the
first units that's erased, so in most cases just checking {{inputs\[0\]}} will return the
wanted result, avoiding involving into the loop.
bq. Do we need maxInvalidUnits * 2 for bytesArrayBuffers and directBuffers? Since we don't
need additional buffer for inputs. The correct size should be ...
Good catch! How about simply having {{maxInvalidUnits = numParityUnits}}? The good is we don't
have to re-allocate the shared buffers for different erasures.
bq. The share buffer size should be always the chunk size, otherwise they can't be shared,
since the dataLen may be different.
We don't have or use chunkSize now. Please note the check is:
+    if (bytesArrayBuffers == null || bytesArrayBuffers[0].length < dataLen) {
+      /**
+       * Create this set of buffers on demand, which is only needed at the first
+       * time running into this, using bytes array.
+       */
bq. We should check erasedOrNotToReadIndexes contains erasedIndexes. 
Good point. The check would avoid bad usage with mismatched inputs and erasedIndexes.
bq. We just need one loop...
Hmm, I'm not sure. We should place the output buffers from caller in the correct positions.
For example:
Assuming 6+3, recovering d0, not-to-read=\[p1, d3\], outputs = \[d0\]. Then adjustedByteArrayOutputsParameter
should be: 
\[p1,d0,s1(d3)\], where s* means shared buffer. 

Would you check again, thanks.

> Enhance raw coder allowing to read least required inputs in decoding
> --------------------------------------------------------------------
>                 Key: HADOOP-11847
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11847
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: io
>            Reporter: Kai Zheng
>            Assignee: Kai Zheng
>              Labels: BB2015-05-TBR
>         Attachments: HADOOP-11847-HDFS-7285-v3.patch, HADOOP-11847-HDFS-7285-v4.patch,
HADOOP-11847-HDFS-7285-v5.patch, HADOOP-11847-HDFS-7285-v6.patch, HADOOP-11847-v1.patch, HADOOP-11847-v2.patch
> This is to enhance raw erasure coder to allow only reading least required inputs while
decoding. It will also refine and document the relevant APIs for better understanding and
usage. When using least required inputs, it may add computating overhead but will possiblly
outperform overall since less network traffic and disk IO are involved.
> This is something planned to do but just got reminded by [~zhz]' s question raised in
HDFS-7678, also copied here:
> bq.Kai Zheng I have a question about decoding: in a (6+3) schema, if block #2 is missing,
and I want to repair it with blocks 0, 1, 3, 4, 5, 8, how should I construct the inputs to
> With this work, hopefully the answer to above question would be obvious.

This message was sent by Atlassian JIRA

View raw message