hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai Zheng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13200) Seeking a better approach allowing to customize and configure erasure coders
Date Wed, 29 Mar 2017 01:34:42 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946376#comment-15946376

Kai Zheng commented on HADOOP-13200:

Thanks Andrew for the nice thoughts.

bq. I saw in a comment that the raw coders are stateless ...
Sorry I think this should be clarified a little bit and it's partly true: 1) When I said it's
stateless, I meant given a coder instance, the encode/decode call is stateless, that is to
say, for a certain group of data to encode/decode, we can have concurrent threads to call
encode/decode so to speed up. 2) The coder instance has states, specific to schema, configuration
(and erasure parameters for decoder).

The biggest concern to merge encoder and decoder together is, the states can be mixed and
incur some runtime overhead. For example, both encoder and decoder can maintain big coding
matrix in core cache; when merged, the memory consumption can be double (yes it can be avoided
by different code path but then complicated). In HDFS side, a coder instance just serves one
role, either encode or decode, not both the same time.

bq. Maybe this is what Colin wanted, since the factory classes look trivial by themselves.

I guess you're right, that's most likely what Colin thought. The factory classes are trivial
right now, but I think it can evolve to make more sense: 1) In HADOOP-13665 I think we can
do the fallback thing here elegantly, where encoder/decoder creating can fail and it can try
next one configured in a list; 2) Quite some time ago when I played with micor benchmarking
of the coders, I found cache coder instance can help in performance, and it's good to do it
here than elsewhere like in {{CodecUtil}}. 

bq. I experimented by putting the NativeRS raw encoder and raw decoder into their Factory
class, and it looks okay since they're pretty small.
It's a very interesting try. Yes the native RS encoder/decoder are small, for other coders
they may not. I thought coders and coder factories may evolve to be bigger in future, for
coders if we want to support incremental encoding/decoding, then more codes will be added.
As HH codec indicates, if any party supports complex codec, the encode/decode logic can be
much complex.

bq. We also should rename RSRawEncoderLegacy to RSLegacyRawEncoder ...
Right, agree.

bq. it seems like we should be creating via the appropriate factory whenever possible.
Can't agree any more.

bq. Overall though I think the current system is okay. The factory is the single entry point
to configuring a RawCoder.
I'm glad the current system works for you. Do you think it's good to fix above problems for
this issue?


> Seeking a better approach allowing to customize and configure erasure coders
> ----------------------------------------------------------------------------
>                 Key: HADOOP-13200
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13200
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Kai Zheng
>            Assignee: Kai Zheng
>            Priority: Blocker
>              Labels: hdfs-ec-3.0-must-do
> This is a follow-on task for HADOOP-13010 as discussed over there. There may be some
better approach allowing to customize and configure erasure coders than the current having
raw coder factory, as [~cmccabe] suggested. Will copy the relevant comments here to continue
the discussion.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message