hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5394) fix race conditions in DN caching and uncaching
Date Thu, 07 Nov 2013 22:41:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13816735#comment-13816735
] 

Colin Patrick McCabe commented on HDFS-5394:
--------------------------------------------

bq. CACHING_CANCELLED discussion

yeah, it does make more sense to explicitly check for the states we expect to be in, rather
than having a catch-all.  I have changed this to use {{Precondition}} to assert that we are
in the correct state, since that seemed more appropriate, and also to be clearer about needing
to be in the {{CACHING}} or {{CACHING_CANCELLED}} state there.

bq. Makes sense, though I'll note that 6,000,000 is 100 minutes, not ten minutes  Overkill.

Noted.  Reduced this to 10 minutes, which should be ample.

bq. Do we need that Preconditions check in setUp? There's already an assumeTrue for the same
thing right above it, so I don't think it'll do anything.

No, it's a repeat of the previous one.  Removed.

bq. I'd like to see the LogVerificationAppender used in testUncachingBlocksBeforeCachingFinishes
too. This seems like it might be flaky though. What was wrong with the old approach that used
a barrier to force ordering?

The problem is we don't have a barrier in all the places we would need it.  We'd need to know
that the DN had received the DN_CACHE heartbeat response and initiated caching during the
3-second window it has to do so, in order to know that we would later see a log message about
cancellation.  To check for the log message would be, as you guessed, flaky and we don't need
another flaky test.

I'd like to keep a LogVerificationAppender for this test in mind as a future improvement,
but still get this fix committed soon since HDFS-5366, HDFS-5320, HDFS-5451, and HDFS-5431
all depend on this patch to some extent.  Perhaps we can roll a test improvement for this
into HDFS-5451, since that JIRA is all about debuggability and logging.

> fix race conditions in DN caching and uncaching
> -----------------------------------------------
>
>                 Key: HDFS-5394
>                 URL: https://issues.apache.org/jira/browse/HDFS-5394
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-5394-caching.001.patch, HDFS-5394-caching.002.patch, HDFS-5394-caching.003.patch,
HDFS-5394-caching.004.patch, HDFS-5394.005.patch, HDFS-5394.006.patch, HDFS-5394.007.patch,
HDFS-5394.008.patch
>
>
> The DN needs to handle situations where it is asked to cache the same replica more than
once.  (Currently, it can actually do two mmaps and mlocks.)  It also needs to handle the
situation where caching a replica is cancelled before said caching completes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message