hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "MaoYuan Xian (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HAMA-756) Timing issue and file merging algorithm in PartitioningRunner make job fail
Date Fri, 10 May 2013 09:07:16 GMT

    [ https://issues.apache.org/jira/browse/HAMA-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13653645#comment-13653645
] 

MaoYuan Xian edited comment on HAMA-756 at 5/10/13 9:05 AM:
------------------------------------------------------------

I understand the call to "FileStatus[] status = fs.listStatus(partitionDir);" is used to avoiding
the race condition.
But, the call to "peer.getNumPeers()" should be also put between two times of calling to peer.sync().
We encountered the problem, when some fast task complete, some slow task just come to somewhere
before calling peer.getNumPeers(). When these slow tasks call peer.getNumPeers(), the getAllPeerNames
method of ZooKeeperSyncClientImpl will finally be trigger, where the call to "byte[] data
= zk.getData(constructKey(taskId.getJobID(), "peers", s),  this, null);" will fail and make
the exception "All peer names could not be retrieved!" happen.

As for the 2nd issue, 

 if (assignedID == peer.getNumPeers())
        assignedID = assignedID - 1;

can solve some promblem but not all. For example:

  // Assume desiredNum=8, peer.getNumPeers()=6
  for (FileStatus statu : status) {
      int partitionID = Integer
          .parseInt(statu.getPath().getName().split("[-]")[1]);  // Let's think, when partitionID=7
      int denom = desiredNum / peer.getNumPeers();  // denom=8/6=1
      int assignedID = partitionID;                // assignedID = 7
      if (denom > 1) {                          // denom value is 1, skip this if block
        assignedID = partitionID / denom;
      }

      if (assignedID == peer.getNumPeers())    // assignedID != peer.getNumPeers()
here because 7 != 6
        assignedID = assignedID - 1;

      // TODO set replica factor to 1.
      // TODO and check whether we can write to specific DataNode.
      if (assignedID == peer.getPeerIndex()) {   // So, assignedID is 7, peer.getPeerIndex()
can only possible be 0~5, no any peer will do the
                                                 //  merge work for part-7
        ...
      }
                
      was (Author: kennethxian):
    I understand the call to "FileStatus[] status = fs.listStatus(partitionDir);" is used
to avoiding the race condition.
But, the call to "peer.getNumPeers()" should be also put between two times of calling to peer.sync().
We encountered the problem, when some fast task complete, some slow task just come to somewhere
before calling peer.sync(). When these slow tasks call peer.sync(), the getAllPeerNames method
of ZooKeeperSyncClientImpl will finally be trigger, where the call to "byte[] data = zk.getData(constructKey(taskId.getJobID(),
"peers", s),  this, null);" will fail and make the exception "All peer names could not be
retrieved!" happen.

As for the 2nd issue, 

 if (assignedID == peer.getNumPeers())
        assignedID = assignedID - 1;

can solve some promblem but not all. For example:

  // Assume desiredNum=8, peer.getNumPeers()=6
  for (FileStatus statu : status) {
      int partitionID = Integer
          .parseInt(statu.getPath().getName().split("[-]")[1]);  // Let's think, when partitionID=7
      int denom = desiredNum / peer.getNumPeers();  // denom=8/6=1
      int assignedID = partitionID;                // assignedID = 7
      if (denom > 1) {                          // denom value is 1, skip this if block
        assignedID = partitionID / denom;
      }

      if (assignedID == peer.getNumPeers())    // assignedID != peer.getNumPeers()
here because 7 != 6
        assignedID = assignedID - 1;

      // TODO set replica factor to 1.
      // TODO and check whether we can write to specific DataNode.
      if (assignedID == peer.getPeerIndex()) {   // So, assignedID is 7, peer.getPeerIndex()
can only possible be 0~5, no any peer will do the
                                                 //  merge work for part-7
        ...
      }

                  
> Timing issue and file merging algorithm in PartitioningRunner make job fail
> ---------------------------------------------------------------------------
>
>                 Key: HAMA-756
>                 URL: https://issues.apache.org/jira/browse/HAMA-756
>             Project: Hama
>          Issue Type: Bug
>            Reporter: MaoYuan Xian
>            Assignee: Edward J. Yoon
>
> There are two major problems in bsp methor of PartitioningRunner may make the partitioning
fail:
> 1. The call to peer.getNumPeers() may trigger the timing issue. In the special situation
when some tasks complete the bsp call but some others just enter the "for (FileStatus statu
: status)" loop, these remaining task calling to peer.getNumPeers() will trigger the problem.
> 2. The algorithm of merging the sequence files has the problem: e.g. when desiredNum
is 8 and partitioning task number (peer.getNumPeers()) is 6, the part-7 directory can not
find the handler to merging it as a file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message