hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiangyu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9143) updateCountForQuota method during EditlogTailer loadEdit can make SNN timeout very often
Date Fri, 25 Sep 2015 09:01:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907829#comment-14907829
] 

jiangyu commented on HDFS-9143:
-------------------------------

Here is the log from SNN:
2015-09-25 12:08:21,289 WARN org.apache.hadoop.ipc.Server: IPC Server handler 118 on 8020,
call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReceivedAndDeleted from
10.39.7.50:35587 Call#1454030 Retry#0: output error
2015-09-25 12:08:21,289 WARN org.apache.hadoop.ipc.Server: IPC Server handler 74 on 8020,
call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReceivedAndDeleted from
10.39.5.22:57698 Call#2825473 Retry#0: output error
2015-09-25 12:08:21,288 WARN org.apache.hadoop.ipc.Server: IPC Server handler 91 on 8020,
call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReceivedAndDeleted from
10.39.5.27:48523 Call#1297974 Retry#0: output error
2015-09-25 12:08:21,288 WARN org.apache.hadoop.ipc.Server: IPC Server handler 50 on 8020,
call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReceivedAndDeleted from
10.39.5.28:47496 Call#1325076 Retry#0: output error

I also log the time of updateCountForQuota:
2015-09-25 03:14:13,951 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file http://10.39.5.41:8480/getJournal?jid=ns1&segmentTxId=205812193&storageInfo=-56%3A358820969%3A0%3ACID-1561e550-a7b9-4886-8a9a-cc2328b82912&ugi=hadoop,
http://10.39.5.42:8480/getJournal?jid=ns1&segmentTxId=205812193&storageInfo=-56%3A358820969%3A0%3ACID-1561e550-a7b9-4886-8a9a-cc2328b82912&ugi=hadoop
of size 221412 edits # 2403 loaded in 0 seconds
2015-09-25 03:14:50,657 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Update count
time :36706
2015-09-25 03:14:50,657 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Loaded
2403 edits starting from txid 205812192

> updateCountForQuota method during EditlogTailer loadEdit can make SNN timeout very often

> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-9143
>                 URL: https://issues.apache.org/jira/browse/HDFS-9143
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.4.0, 2.6.0
>            Reporter: jiangyu
>            Priority: Minor
>
> I have seen many logs from datanodes in our cluster reporting socket timeout when sending
heartbeat or blockReceivedAndDeleted to Standby NameNode, but it never happen to Active NameNode.
 
> At first, i thought it maybe caused by Editlog Tailer fetch Editlog too much making full
gc, but after i watched the gc log, it is not. So i investigate the code path and log, find
it only take very few seconds for the SNN to fetch the journal and merge it. But when you
open the webpage of SNN during merge processing, it can not response  like stop the world
time of full GC, but there is no gc at that time. So i jstack SNN for some time, and finding
all the time consumed by updateCountForQuota method in FSImage.  
> The updateCountForQuota is called ervry time when loadEdits, it update the count of each
directory with quota in the namespace from ROOT, besides it hold the write lock of FSImage,
so every time when SNN merge the edit from JN, it is always making the stop world.  
> I don't think it is necessary for SNN to updateCountForQuota everytime when tail the
edit, when trasition to Active, it call updateCountForQuota and never missing any quota data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message