ratis-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "runzhiwang (Jira)" <j...@apache.org>
Subject [jira] [Updated] (RATIS-878) Infinite restart of LogAppender
Date Mon, 29 Jun 2020 06:49:00 GMT

     [ https://issues.apache.org/jira/browse/RATIS-878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

runzhiwang updated RATIS-878:
-----------------------------
    Attachment: screenshot-1.png

> Infinite restart of  LogAppender
> --------------------------------
>
>                 Key: RATIS-878
>                 URL: https://issues.apache.org/jira/browse/RATIS-878
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: runzhiwang
>            Assignee: runzhiwang
>            Priority: Blocker
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I found there
are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and
each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related
to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState:
Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}

>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from
senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender
throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed,
then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown
-> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
-> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed:
 RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender
was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender
stop after SegmentedRaftLog close.
>  !screenshot-1.png! 
> The details please refer it here [RATIS-840|https://issues.apache.org/jira/browse/RATIS-840].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message