ratis-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "runzhiwang (Jira)" <j...@apache.org>
Subject [jira] [Commented] (RATIS-878) Infinite restart of LogAppender
Date Mon, 29 Jun 2020 10:55:00 GMT

    [ https://issues.apache.org/jira/browse/RATIS-878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147678#comment-17147678

runzhiwang commented on RATIS-878:

bq. One bug I can see in the above function is that if the state is EXCEPTION, it would return
from "if (!isRunning())" check itself without transitioning to CLOSING state. I think it should
just check for CLOSING or CLOSED in the if condition.

I create RATIS-989 to fix this. And RATIS-878 will try to limit the restart times of LogAppender
in a certain time interval. 

> Infinite restart of  LogAppender
> --------------------------------
>                 Key: RATIS-878
>                 URL: https://issues.apache.org/jira/browse/RATIS-878
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: runzhiwang
>            Assignee: runzhiwang
>            Priority: Blocker
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I found there
are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and
each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related
to [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState:
Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}

>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from
>  !screenshot-2.png! 
>  !screenshot-3.png! 
> *Why [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender
throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed,
then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown
-> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
-> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed:
 RaftServerImpl::shutdown -> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender
was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender
stop after SegmentedRaftLog close.
>  !screenshot-4.png! 
> More details please refer it here [RATIS-840|https://issues.apache.org/jira/browse/RATIS-840].

This message was sent by Atlassian Jira

View raw message