hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-5064) Standby checkpoints should not block concurrent readers
Date Tue, 25 Feb 2014 07:41:27 GMT

     [ https://issues.apache.org/jira/browse/HDFS-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Aaron T. Myers updated HDFS-5064:

    Attachment: HDFS-5064.patch

Thanks a lot for the review, Andrew. Here's an updated patch which is rebased on trunk.

bq. I have just one nit: 64-bit reads are not atomic in the current Java memory model, so
we need to slap a volatile on NNStorage#mostRecentCheckpointId since the getter is no longer

I'm assuming you mean {{NNStorage#mostRecentCheckpointTxId}}? If so, that was already marked
volatile by the original patch. Or were you perhaps referring to something else?

Kihwal, do you have any further thoughts on this change?

> Standby checkpoints should not block concurrent readers
> -------------------------------------------------------
>                 Key: HDFS-5064
>                 URL: https://issues.apache.org/jira/browse/HDFS-5064
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha, namenode
>    Affects Versions: 2.3.0
>            Reporter: Aaron T. Myers
>            Assignee: Aaron T. Myers
>         Attachments: HDFS-5064.patch, HDFS-5064.patch
> We've observed an issue which causes fetches of the {{/jmx}} page of the NN to take a
long time to load when the standby is in the process of creating a checkpoint.
> Even though both creating the checkpoint and gathering the statistics for {{/jmx}} take
only the FSNS read lock, the issue is that since the FSNS uses a _fair_ RW lock, a single
writer attempting to get the lock will block all threads attempting to get only the read lock
for the duration of the checkpoint. This will cause {{/jmx}}, and really any thread only attempting
to get the read lock, to block for the duration of the checkpoint, even though they should
be able to proceed concurrently with the checkpointing thread.

This message was sent by Atlassian JIRA

View raw message