hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-3901) QJM: send 'heartbeat' messages to JNs even when they are out-of-sync
Date Fri, 07 Sep 2012 03:45:08 GMT

     [ https://issues.apache.org/jira/browse/HDFS-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Todd Lipcon updated HDFS-3901:
------------------------------

    Attachment: hdfs-3901.txt

Attached patch implements the improvement as described. I also cleaned up the display on the
web UI and included a time-based lag measurement instead of simply the transaction-based.
This way monitoring software can have reasonable user-understandable defaults (eg "alert if
one of the loggers is more than 1 minute behind") without having to know the transaction rate
of the individual cluster.

In addition to the modified unit test, I also ran this on a cluster and verified that the
web UI readouts were reasonable. kill -STOP of one JN caused that node's lag readout to increase
steadily, and when I kill -CONTed it, it slowly dropped back down to 0 as it caught up.
                
> QJM: send 'heartbeat' messages to JNs even when they are out-of-sync
> --------------------------------------------------------------------
>
>                 Key: HDFS-3901
>                 URL: https://issues.apache.org/jira/browse/HDFS-3901
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: QuorumJournalManager (HDFS-3077)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-3901.txt
>
>
> Currently, if one of the JNs has fallen out of sync with the writer (eg because it went
down), it will be marked as such until the next log roll. This causes the writer to no longer
send any RPCs to it. This means that the JN's metrics will no longer reflect up-to-date information
on how far laggy they are.
> This patch will introduce a heartbeat() RPC that has no effect except to update the JN's
view of the latest committed txid. When the writer is talking to an out-of-sync logger, it
will send these heartbeat messages once a second.
> In a future patch we can extend the heartbeat functionality so that NNs periodically
check their connections to JNs if no edits arrive, such that a fenced NN won't accidentally
continue to serve reads indefinitely.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message