hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai Zheng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
Date Fri, 15 Apr 2016 21:00:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243638#comment-15243638

Kai Zheng commented on HADOOP-13010:

Thanks Colin.
bq. If I understand correctly, you're making the case that there is data (such as matrices)
which should be shared between multiple concurrent encode or decode operations. If that's
the case, then let's make that data static and share it between all instances. But I still
think that Encoder/Decoder should manage its own buffers rather than having them passed in
on every call.
Yes you're right I meant some data to be shared between multiple concurrent encode or decode
operations. The data only makes sense for a coder instance (binds a schema) so it's not suitable
to be static; on the other hand it's also decode call specific so it's also not suitable to
reside in the coder instance.
In {{erasure_coder.c}}, and {{processErasures}} function, note the following codes:
static int processErasures(IsalDecoder* pCoder, unsigned char** inputs,
                                    int* erasedIndexes, int numErased) {
  int i, r, ret, index;
  int numDataUnits = pCoder->coder.numDataUnits;
  int isChanged = 0;

  for (i = 0, r = 0; i < numDataUnits; i++, r++) {
    while (inputs[r] == NULL) {

    if (pCoder->decodeIndex[i] != r) {
      pCoder->decodeIndex[i] = r;
      isChanged = 1;

  for (i = 0; i < numDataUnits; i++) {
    pCoder->realInputs[i] = inputs[pCoder->decodeIndex[i]];

  if (isChanged == 0 &&
          compare(pCoder->erasedIndexes, pCoder->numErased,
                           erasedIndexes, numErased) == 0) {
    return 0; // Optimization, nothing to do

{{erasedIndexes}} and {{inputs}} are passed from {{decode}} call, which may be the same in
most times but still different in many times. That's why the call with the two parameters
would generate some data better to be cached in the coder instance but the two parameters
themselves are not suitable to be a part of coder instance state.

> Refactor raw erasure coders
> ---------------------------
>                 Key: HADOOP-13010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13010
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Kai Zheng
>            Assignee: Kai Zheng
>             Fix For: 3.0.0
>         Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch
> This will refactor raw erasure coders according to some comments received so far.
> * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to rely class
inheritance to reuse the codes, instead they can be moved to some utility.
> * Suggested by [~jingzhao] somewhere quite some time ago, better to have a state holder
to keep some checking results for later reuse during an encode/decode call.
> This would not get rid of some inheritance levels as doing so isn't clear yet for the
moment and also incurs big impact. I do wish the end result by this refactoring will make
all the levels more clear and easier to follow.

This message was sent by Atlassian JIRA

View raw message