hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
Date Fri, 15 Apr 2016 20:08:25 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243509#comment-15243509

Colin Patrick McCabe commented on HADOOP-13010:

bq. The problem is, a decoder associates expensive coding buffers and computed coding matrices,
which would be good to stay in CPU core near enough caches for the performance. The cached
data is per decoder, not only schema specific, but also erasure index specific in decode call,
so it's not good to keep the cache out of decoder, but still makes sense to cache it because
in HDFS side it's repeatedly called in a loop for a large block size (64k cell size ->
256mb block size). You might have a check about the native codes for native coders about the
expensive buffers and data cached in every decode call. We had benchmarked the coders and
showed this optimization obtained great speedup. Java InputStreams are similar to here, but
not exactly because it's pure view-only and leverages OS/IO level caches for file reading

If I understand correctly, you're making the case that there is data (such as matrices) which
should be shared between multiple concurrent encode or decode operations.  If that's the case,
then let's make that data static and share it between all instances.  But I still think that
Encoder/Decoder should manage its own buffers rather than having them passed in on every call.

bq. Having the common base class would allow encoder and decoder to share common properties,
not just configurations, but also schema info and some options. We can also say that encoder
and decoder are also coders, which allows to write some common behaviors to deal with coders,
not encoder or decoder specific. I understand it should also work by composition, but right
now I don't see very much benefits to switch this from one style to the other, or troubles
if we don't.

Hmm.  The only state in {{AbstractRawErasureCoder.java}} is configuration state.  I don't
see why we need this class.  Everything in there could and should be a utility function.

The benefit of getting rid of this class is that with a shallower inheritance hierarchy, it's
easier to understand what's going on.  To continue the analogy with Java, InputStream and
OutputStream don't share a common base class.

bq. It sounds better not to have the interfaces since the benefit is obvious. So in summary
how about having these classes (no interface) now: still AbstractRawErasureCoder, RawErasureEncoder/Decoder
(no Abstract prefix now, with the original interface combined), and all kinds of concrete
inherent encoders/decoders. All client codes will declare RawErasureEncoder/Decoder type when
creating instances.

It seems reasonable, but I don't see the need for AbstractRawErasureCoder.

> Refactor raw erasure coders
> ---------------------------
>                 Key: HADOOP-13010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13010
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Kai Zheng
>            Assignee: Kai Zheng
>             Fix For: 3.0.0
>         Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch
> This will refactor raw erasure coders according to some comments received so far.
> * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to rely class
inheritance to reuse the codes, instead they can be moved to some utility.
> * Suggested by [~jingzhao] somewhere quite some time ago, better to have a state holder
to keep some checking results for later reuse during an encode/decode call.
> This would not get rid of some inheritance levels as doing so isn't clear yet for the
moment and also incurs big impact. I do wish the end result by this refactoring will make
all the levels more clear and easier to follow.

This message was sent by Atlassian JIRA

View raw message