hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xudong Cao (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-14646) Standby NameNode should terminate the FsImage put process as soon as possible if the peer NN is not in the appropriate state to receive an image.
Date Mon, 15 Jul 2019 02:59:00 GMT

     [ https://issues.apache.org/jira/browse/HDFS-14646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xudong Cao updated HDFS-14646:
------------------------------
    Description: 
*Problem Description:*
 In the multi-NameNode scenario, when a SNN uploads a FsImage, it will put the image to all
other NNs (whether the peer NN is an ANN or not), and even if the peer NN immediately replies
with an error (such as TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult .OLD_TRANSACTION_ID_FAILURE,
etc.), the local SNN will not terminate the put process immediately, but will put the FsImage
completely to the peer NN, and will not read the peer NN's reply until the put is completed.

In a relatively large HDFS cluster, the size of FsImage can often reach about 30G. In this
case, this invalid put brings two problems:
 # Wasting time and bandwidth.
 # Since the ImageServlet of the peer NN no longer receives the FsImage, the socket Send-Q
of the local SNN is very large, and the ImageUpload thread will be blocked in writing socket
for a long time, eventually causing the local StandbyCheckpointer thread often blocked for
several hours.

*An example is as follows:*
 In the following figure, the local NN 100.76.3.234 is a SNN, the peer NN 100.76.3.170 is
another SNN, and the 8080 is NN Http port. When the local SNN starts to put FsImage, 170 will
reply with a NOT_ACTIVE_NAMENODE_FAILURE error immediately. In this case, the local SNN should
terminate put immediately, but in fact, local SNN has to wait until the image has been completely
put to peer NN,and then canl read the response.
 # At this time, since the ImageServlet of the peer NN no longer receives the FsImage, the
socket Send-Q of the local SNN is very large:          !largeSendQ.png!

      2. Moreover, the local SNN's ImageUpload thread will be blocked in writing socket
for a long time:

          !blockedInWritingSocket.png! .

 

     3. Eventually, the StandbyCheckpointer thread of local SNN is waiting for the execution
result of the ImageUpload thread, blocking in Future.get(), and the blocking time may be as
long as several hours:

            !get1.png!

                           

       !get2.png!

 

 

*Solution:*
 When the local SNN plans to put a FsImage to the peer NN, it need to test whether he really
need to put it at this time. The test process is:
 # Establish an HTTP connection with the peer NN, send the put request, and then immediately
read the response (this is the key point). If the peer NN replies with any of the following
errors (TransferResult.AUTHENTICATION_FAILURE, TransferResult.NOT_ACTIVE_NAMENODE_FAILURE,
TransferResult. OLD_TRANSACTION_ID_FAILURE), immediately terminate the put process.
 # If the peer NN is truly the ANN and can receive the FsImage normally, it will reply to
the local SNN with an HTTP response 410 (HttpServletResponse.SC_GONE, which is TransferResult.UNEXPECTED_FAILURE).
At this time, the local SNN can really begin to put the image.

*Note:*
 This problem needs to be reproduced in a large cluster (the size of FsImage in our cluster
is about 30G). Therefore, unit testing is difficult to write. In our cluster, after the modification,
the problem has been solved and there is no such thing as a large backlog of Send-Q.

  was:
*Problem Description:*
 In the multi-NameNode scenario, when a SNN uploads a FsImage, it will put the image to all
other NNs (whether the peer NN is an ANN or not), and even if the peer NN immediately replies
with an error (such as TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult .OLD_TRANSACTION_ID_FAILURE,
etc.), the local SNN will not terminate the put process immediately, but will put the FsImage
completely to the peer NN, and will not read the peer NN's reply until the put is completed.

In a relatively large HDFS cluster, the size of FsImage can often reach about 30G. In this
case, this invalid put brings two problems:
 # Wasting time and bandwidth.
 # Since the ImageServlet of the peer NN no longer receives the FsImage, the socket Send-Q
of the local SNN is very large, and the ImageUpload thread will be blocked in writing socket
for a long time, eventually causing the local StandbyCheckpointer thread often blocked for
several hours.

*An example is as follows:*
 In the following figure, the local NN 100.76.3.234 is a SNN, the peer NN 100.76.3.170 is
another SNN, and the 8080 is NN Http port. When the local SNN starts to put FsImage, 170 will
reply with a NOT_ACTIVE_NAMENODE_FAILURE error immediately. In this case, the local SNN should
terminate put immediately, but in fact, local SNN has to wait until the image has been completely
put to peer NN,and then canl read the response.
 # At this time, since the ImageServlet of the peer NN no longer receives the FsImage, the
socket Send-Q of the local SNN is very large:         !largeSendQ.png!

      2. Moreover, the local SNN's ImageUpload thread will be blocked in writing socket
for a long time:

          !blockedInWritingSocket.png!.

 

     3. Eventually, the StandbyCheckpointer thread of local SNN is waiting for the execution
result of the ImageUpload thread, blocking in Future.get(), and the blocking time may be as
long as several hours:

           !get1.png!

          

           

 

 

*Solution:*
 When the local SNN is ready to put a FsImage to the peer NN, it need to test whether he
really need to put it at this time. The test process is:
 # Establish an HTTP connection with the peer NN, send a put request, and then immediately
read the response (this is the key point). If the peer NN replies with any of the following
errors (TransferResult.AUTHENTICATION_FAILURE, TransferResult.NOT_ACTIVE_NAMENODE_FAILURE,
TransferResult.
 # If the peer NN is truly the ANN and can receive the FsImage normally, it will reply to
the local SNN with an HTTP response 410 (HttpServletResponse.SC_GONE, which is TransferResult.UNEXPECTED_FAILURE).
At this time, the local SNN can really begin to put the image.

*Note:*
 This problem needs to be reproduced in a large cluster (the size of FsImage in our cluster
is about 30G). Therefore, unit testing is difficult to write. In our real cluster, after the
modification, the problem has been solved. There is no such thing as a large backlog of Send-Q.


> Standby NameNode should terminate the FsImage put process as soon as possible if the
peer NN is not in the appropriate state to receive an image.
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-14646
>                 URL: https://issues.apache.org/jira/browse/HDFS-14646
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 3.1.2
>            Reporter: Xudong Cao
>            Assignee: Xudong Cao
>            Priority: Major
>         Attachments: blockedInWritingSocket.png, get1.png, get2.png, largeSendQ.png
>
>
> *Problem Description:*
>  In the multi-NameNode scenario, when a SNN uploads a FsImage, it will put the image
to all other NNs (whether the peer NN is an ANN or not), and even if the peer NN immediately
replies with an error (such as TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult
.OLD_TRANSACTION_ID_FAILURE, etc.), the local SNN will not terminate the put process immediately,
but will put the FsImage completely to the peer NN, and will not read the peer NN's reply
until the put is completed.
> In a relatively large HDFS cluster, the size of FsImage can often reach about 30G. In
this case, this invalid put brings two problems:
>  # Wasting time and bandwidth.
>  # Since the ImageServlet of the peer NN no longer receives the FsImage, the socket Send-Q
of the local SNN is very large, and the ImageUpload thread will be blocked in writing socket
for a long time, eventually causing the local StandbyCheckpointer thread often blocked for
several hours.
> *An example is as follows:*
>  In the following figure, the local NN 100.76.3.234 is a SNN, the peer NN 100.76.3.170
is another SNN, and the 8080 is NN Http port. When the local SNN starts to put FsImage, 170
will reply with a NOT_ACTIVE_NAMENODE_FAILURE error immediately. In this case, the local SNN
should terminate put immediately, but in fact, local SNN has to wait until the image has been
completely put to peer NN,and then canl read the response.
>  # At this time, since the ImageServlet of the peer NN no longer receives the FsImage,
the socket Send-Q of the local SNN is very large:          !largeSendQ.png!
>       2. Moreover, the local SNN's ImageUpload thread will be blocked in writing socket
for a long time:
>           !blockedInWritingSocket.png! .
>  
>      3. Eventually, the StandbyCheckpointer thread of local SNN is waiting for the
execution result of the ImageUpload thread, blocking in Future.get(), and the blocking time
may be as long as several hours:
>             !get1.png!
>                            
>        !get2.png!
>  
>  
> *Solution:*
>  When the local SNN plans to put a FsImage to the peer NN, it need to test whether he
really need to put it at this time. The test process is:
>  # Establish an HTTP connection with the peer NN, send the put request, and then immediately
read the response (this is the key point). If the peer NN replies with any of the following
errors (TransferResult.AUTHENTICATION_FAILURE, TransferResult.NOT_ACTIVE_NAMENODE_FAILURE,
TransferResult. OLD_TRANSACTION_ID_FAILURE), immediately terminate the put process.
>  # If the peer NN is truly the ANN and can receive the FsImage normally, it will reply
to the local SNN with an HTTP response 410 (HttpServletResponse.SC_GONE, which is TransferResult.UNEXPECTED_FAILURE).
At this time, the local SNN can really begin to put the image.
> *Note:*
>  This problem needs to be reproduced in a large cluster (the size of FsImage in our cluster
is about 30G). Therefore, unit testing is difficult to write. In our cluster, after the modification,
the problem has been solved and there is no such thing as a large backlog of Send-Q.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message