hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chen Liang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12415) Ozone: TestXceiverClientManager and TestAllocateContainer occasionally fails
Date Thu, 12 Oct 2017 17:52:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202341#comment-16202341
] 

Chen Liang commented on HDFS-12415:
-----------------------------------

I looked in this a little bit too. What was happening seems to be that {{SCMCommonPolicy#chooseDatanodes}}
calls {{nodeManager.getNodes(OzoneProtos.NodeState.HEALTHY);}}, but the returned list contains
a {{null}} datanode id entry. So the {{hasEnoughSpace(d, sizeRequired)}} call on the null
d will fail with NPE. And the returned list with a null entry is returned by {{SCMNodeManager#getNodes}},
where seems there is some datanode id in {{healthyNodes}} but not present in {{nodes}} map.

I don't see how could a datanode id be present in {{healthyNodes}} but not in {{nodes}}, because
the first thing of register is to always add that datanode to {{nodes}}, before {{healthyNodes}}.
I can only think of the issue being just like [~msingh] mentioned, that it is probably due
to some unexpected race condition behaviour when two register calls happen and change the
HashMap {{nodes}} at the same time. So I would +1 on Mukul's change. Additionally, I ran {{TestXceiverClientManager}}
several ten times with v005 patch applied. The test did not fail.

> Ozone: TestXceiverClientManager and TestAllocateContainer occasionally fails
> ----------------------------------------------------------------------------
>
>                 Key: HDFS-12415
>                 URL: https://issues.apache.org/jira/browse/HDFS-12415
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7240
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>         Attachments: HDFS-12415-HDFS-7240.001.patch, HDFS-12415-HDFS-7240.002.patch,
HDFS-12415-HDFS-7240.003.patch, HDFS-12415-HDFS-7240.004.patch, HDFS-12415-HDFS-7240.005.patch
>
>
> TestXceiverClientManager seems to be occasionally failing in some jenkins jobs,
> {noformat}
> java.lang.NullPointerException
>  at org.apache.hadoop.ozone.scm.node.SCMNodeManager.getNodeStat(SCMNodeManager.java:828)
>  at org.apache.hadoop.ozone.scm.container.placement.algorithms.SCMCommonPolicy.hasEnoughSpace(SCMCommonPolicy.java:147)
>  at org.apache.hadoop.ozone.scm.container.placement.algorithms.SCMCommonPolicy.lambda$chooseDatanodes$0(SCMCommonPolicy.java:125)
> {noformat}
> see more from [this report|https://builds.apache.org/job/PreCommit-HDFS-Build/21065/testReport/]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message