hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai Zheng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12041) Implement another Reed-Solomon coder in pure Java
Date Tue, 02 Feb 2016 00:25:39 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127327#comment-15127327
] 

Kai Zheng commented on HADOOP-12041:
------------------------------------

Thanks Zhe for the careful review. The comments are very helpful.
bq. makeValidIndexes should be static and maybe moved to CodecUtil
How about adding {{CoderUtil}} for coder implementation? This is something I wanted to do
but get missed. As we're going to have various raw erasure coders, we'll need a general utility
class to embrace such. {{CodecUtil}} can focus on caller interface side.
bq. The reported checkstyle issue looks valid.
Sure, pending the fixup for your comments. :)
bq. When you say the new coder is compatible with ISA-L coder, you mean I can use new Java
coder to decode data encoded with ISA-L, right?
Exactly. Probably that's why we're here. As you may have already noted, there does be some
thin HDFS client that's not equipped with native hadoop library, so pure Java solution in
good performance would be needed in case the ISA-L coder isn't available.
bq. gftbls is hard to understand by name, does it mean gfTables?
Ah, yes. Will rename it.
bq. Any reason to allocate encodeMatrix, decodeMatrix, and invertMatrix as 1D arrays but use
them as matrices? Can we use 2D arrays?
I borrow the approach from the ISA-L library as the whole is. I thought it goes like that
for better performance, because the mentioned matrices are highly frequently accessed during
encoding/decoding computing thus better to be compact together so able to remain in the cache.
bq. The below is not easy to understand. Why don't we need to prepare decodeMatrix if the
cached indexes haven't changed? 
I thought better to add some comment for easier understanding. The decodeMatrix is only relevant
to the schema and erased indexes. The schema is bounded during initialization phase and won't
change; the erased indexes however can change during different calls, but if not changed,
the decodeMatrix won't need to be recomputed. As indicated in HDFS side, we're very likely
calling a decoder repeatedly for a corrupt block group which is of the fixed erasure indexes,
the optimization trick here would make some sense.
bq. RSUtil2#genReedSolomonMatrix is unused
OK, let me remove it for now and add it later when support the mode of encode matrix generation.
bq. Patch is already very large, I think we should add package-info separately.
Yeah, sure. I think we can add them when doing the coder rename, if we can bear the checking
style issue complaining the lack of them.
bq. So about the incompatibility between HDFS-RAID coder and the new Java coder: is it because
they use different GF matrices?
I think so. The implementation approach is also quite different. As you see, in this new Java
coder (same in the ISA-L library), encode/decode matrix is previously calculated, the encoding
and decoding are unified in the same way that purely does the matrix operation against the
input data bytes. The benefit is, the decoder and encoder can be optimized together and are
of the same high performance.

> Implement another Reed-Solomon coder in pure Java
> -------------------------------------------------
>
>                 Key: HADOOP-12041
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12041
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Kai Zheng
>            Assignee: Kai Zheng
>         Attachments: HADOOP-12041-v1.patch, HADOOP-12041-v2.patch, HADOOP-12041-v3.patch,
HADOOP-12041-v4.patch, HADOOP-12041-v5.patch, HADOOP-12041-v6.patch
>
>
> Currently existing Java RS coders based on {{GaloisField}} implementation have some drawbacks
or limitations:
> * The decoder computes not really erased units unnecessarily (HADOOP-11871);
> * The decoder requires parity units + data units order for the inputs in the decode API
(HADOOP-12040);
> * Need to support or align with native erasure coders regarding concrete coding algorithms
and matrix, so Java coders and native coders can be easily swapped in/out and transparent
to HDFS (HADOOP-12010);
> * It's unnecessarily flexible but incurs some overhead, as HDFS erasure coding is totally
a byte based data system, we don't need to consider other symbol size instead of 256.
> This desires to implement another  RS coder in pure Java, in addition to the existing
{{GaliosField}} from HDFS-RAID. The new Java RS coder will be favored and used by default
to resolve the related issues. The old HDFS-RAID originated coder will still be there for
comparing, and converting old data from HDFS-RAID systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message