hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10340) data node sudden killed
Date Tue, 23 Aug 2016 13:57:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432830#comment-15432830

Kihwal Lee commented on HDFS-10340:

I checked the oom killer code and it is SIGKILL as you pointed out. It might have used SIGTERM
in the ancient versions. This wouldn't have been caught by the sys call snooping, as it does
not involve any. It sure looks like something else sending SIGTERM to the datanode process.
I looked over the openjdk8 source but couldn't find anything raising SIGTERM for itself to
shutdown.  Whoever the sender is, you should be able to catch it with the systemtap instrumentation.

We have had similar issues due to stale pid files, but that can't be it if no service was
(re)started at that time. 

bq. if user of DataNode is same with NodeManager, maybe it is related with YARN-4459
Are you saying that your cluster is configured this way? If so, I agree YARN-4459 is a good
candidate. If not, we are back to square one.  In any case, the systemtap instrumentation
should help identifying the source of the signal.

> data node sudden killed 
> ------------------------
>                 Key: HDFS-10340
>                 URL: https://issues.apache.org/jira/browse/HDFS-10340
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0
>         Environment: Ubuntu 16.04 LTS , RAM 16g , cpu core : 8 , hdd 100gb, hadoop 2.6.0
>            Reporter: tu nguyen khac
>            Priority: Critical
> I tried to setup a new data node using ubuntu 16 
> and get it join to an existed Hadoop Hdfs cluster ( there are 10 nodes in this cluster
and they all run on centos Os 6 ) 
> But when i try to boostrap this node , after about 10 or 20 minutes i get this strange
errors : 
> 2016-04-26 20:12:09,394 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /, dest: /, bytes: 79902, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_1379996362_1,
offset: 0, srvID: 225f5b43-1dd3-4ac6-88d2-1e8d27dba55b, blockid: BP-352432948-,
duration: 15331628
> 2016-04-26 20:12:09,394 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
BP-352432948-, type=LAST_IN_PIPELINE, downstreams=0:[]
> 2016-04-26 20:12:25,410 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner:
Verification succeeded for BP-352432948-
> 2016-04-26 20:12:25,411 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner:
Verification succeeded for BP-352432948-
> 2016-04-26 20:13:18,546 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Scheduling blk_1074038502_789829 file /data/hadoop_data/backup/data/current/BP-352432948-
for deletion
> 2016-04-26 20:13:18,562 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
Deleted BP-352432948- blk_1074038502_789829 file /data/hadoop_data/backup/data/current/BP-352432948-
> 2016-04-26 20:15:46,481 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED
> 2016-04-26 20:15:46,504 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down DataNode at bigdata-dw-24-197/
> ************************************************************/

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message