hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chen Liang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11142) TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit fails in trunk
Date Mon, 12 Mar 2018 20:33:18 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395828#comment-16395828
] 

Chen Liang commented on HDFS-11142:
-----------------------------------

Hi [~linyiqun],

Thanks for reporting this! One question, is there any analysis on how did the GC pause cause
the NPE? Because I was not able to reproduce the error, and there is no NPE in the pasted
log, it is not quite clear how did NPE happen and it is hard for me to tell whether the NPE
will be gone with the patch. Did the patch fix the error in your environment?

In fact, it is interesting to me that a GC pause could cause an NPE here. I don't think this
is supposed to happen...I think we'd better look more carefully on this, as it might be even
some bug elsewhere rather than in the unit tests. Also, I'm not sure if it is the best way
to catch all the exceptions here. Because we want the tests to report errors when they should,
not swallowing all of the errors. There may be exceptions we do want to throw.

Apart from that, one minor comments, you may use a lambda function for {{.waitFor()}}
{code:java}
GenericTestUtils.waitFor(() -> { // <-- use lambda
  boolean result = true;
  try {
    nnProxy.blockReport(bpRegistration, bpId, reports,
        new BlockReportContext(1, 0, reportId, fullBrLeaseId, sorted));
  } catch (Exception e) {
    result = false;
  }
  return result;
}, 3000, 120000);{code}
 

> TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit fails in trunk
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-11142
>                 URL: https://issues.apache.org/jira/browse/HDFS-11142
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Yiqun Lin
>            Assignee: Yiqun Lin
>            Priority: Major
>         Attachments: HDFS-11142.001.patch, test-fails-log.txt
>
>
> The test {{TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit}} fails
in trunk. I looked into this, it seemed the long-time gc caused the datanode to be shutdown
unexpectedly when did the large block reporting. And then the NPE threw in the test. The related
output log:
> {code}
> 2016-11-15 11:31:18,889 [DataNode: [[[DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/2/dfs/data/data1,
[DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/2/dfs/data/data2]]
 heartbeating to localhost/127.0.0.1:51450] INFO  datanode.DataNode (BPServiceActor.java:blockReport(415))
- Successfully sent block report 0x2ae5dd91bec02273,  containing 2 storage report(s), of which
we sent 2. The reports had 0 total blocks and used 1 RPC(s). This took 0 msec to generate
and 49 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
> 2016-11-15 11:31:18,890 [DataNode: [[[DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/2/dfs/data/data1,
[DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/2/dfs/data/data2]]
 heartbeating to localhost/127.0.0.1:51450] INFO  datanode.DataNode (BPOfferService.java:processCommandFromActive(696))
- Got finalize command for block pool BP-814229154-172.17.0.3-1479209475497
> 2016-11-15 11:31:24,026 [org.apache.hadoop.util.JvmPauseMonitor$Monitor@97e93f1] INFO
 util.JvmPauseMonitor (JvmPauseMonitor.java:run(205)) - Detected pause in JVM or host machine
(eg GC): pause of approximately 4936ms
> GC pool 'PS MarkSweep' had collection(s): count=1 time=4194ms
> GC pool 'PS Scavenge' had collection(s): count=1 time=765ms
> 2016-11-15 11:31:24,026 [org.apache.hadoop.util.JvmPauseMonitor$Monitor@5a4bef8] INFO
 util.JvmPauseMonitor (JvmPauseMonitor.java:run(205)) - Detected pause in JVM or host machine
(eg GC): pause of approximately 4898ms
> GC pool 'PS MarkSweep' had collection(s): count=1 time=4194ms
> GC pool 'PS Scavenge' had collection(s): count=1 time=765ms
> 2016-11-15 11:31:24,114 [main] INFO  hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(1943))
- Shutting down the Mini HDFS Cluster
> 2016-11-15 11:31:24,114 [main] INFO  hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdownDataNodes(1983))
- Shutting down DataNode 0
> {code}
> The stack infos:
> {code}
> java.lang.NullPointerException: null
> 	at org.apache.hadoop.hdfs.server.datanode.TestLargeBlockReport.testBlockReportSucceedsWithLargerLengthLimit(TestLargeBlockReport.java:97)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message