hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5394) fix race conditions in DN caching and uncaching
Date Wed, 06 Nov 2013 00:02:17 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814426#comment-13814426
] 

Andrew Wang commented on HDFS-5394:
-----------------------------------

Sorry for leaving this for so long, got tied up in a variety of different things. Thanks for
bumping it based on feedback thus far, I think we're close.

* I like the test stub for mlock

Nits:

* Unused imports in FsDatasetImpl and FsVolumeImpl
* Do we still need to rename {{getExecutor}} to {{getCacheExecutor}} in FsVolumeImpl?
* {{State#isUncaching()}} is unused
* Could use a core pool size of 0 for {{uncachingExecutor}}, I don't think it's that latency
sensitive
* usedBytes javadoc: "more things to cache that we can't actually do because of" is an awkward
turn of phrase, maybe say "assign more blocks than we can actually cache because of" instead
* MappableBlock#load javadoc: visibleLeng parameter should be renamed to length. The return
value is now also a MappableBlock, not a boolean.
* Key: rename {{id}} to {{blockId}} for clarity? or add a bit of javadoc
* Naming the HashMap {{replicaMap}} is confusing since there's already a datanode {{ReplicaMap}}
class. Maybe {{mappableBlockMap}} instead?

Impl:
* Caching can fail if the underlying block is invalidated in between getting the block's filename
and running the CacheTask. It'd be nice to distinguish this race from a real error for when
we do metrics (and also quash the exception).
* If we get a {{DNA_CACHE}} for a block that is currently being uncached, shouldn't we try
to cancel the uncache and re-cache it? The NN will resend the command, but it'd be better
to not have to wait for that.
{code}
          if ((value == null) || (value.state != State.CACHING)) {
{code}
* Could this be written with {{value.state == State.CACHING_CANCELLED}} instead? Would be
clearer, and I believe equivalent since {{uncacheBlock}} won't set the state to {{UNCACHING}}
if it's {{CACHING}} or {{CACHING_CANCELLED}}.
* Even better would be interrupting a {{CachingTask}} on uncache since it'll save us I/O and
CPU.
* Could we combine {{CACHING_CANCELLED}} into {{UNCACHING}}? It seems like {{CachingTask}}
could check for {{UNCACHING}} in that if statement at the end and uncache, same sort of change
for {{uncacheBlock}}.
* I think using a switch/case on the prevValue.state in uncacheBlock would be clearer

Test:
* 6,000,000 milliseconds seem like very long test timeouts :) Can we change them to say, 60,000?
* Are these new log prints for sanity checking? Maybe we can just remove them.
* Some of the comments seem to refer to a previous patch version that used a countdown latch.
* It's unclear what this is testing beyond caching and then uncaching a bunch of blocks. Can
we check for log prints to see that it's actually cancelling as expected? Any other ideas
for definitively hitting cancellation?


> fix race conditions in DN caching and uncaching
> -----------------------------------------------
>
>                 Key: HDFS-5394
>                 URL: https://issues.apache.org/jira/browse/HDFS-5394
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-5394-caching.001.patch, HDFS-5394-caching.002.patch, HDFS-5394-caching.003.patch,
HDFS-5394-caching.004.patch, HDFS-5394.005.patch, HDFS-5394.006.patch
>
>
> The DN needs to handle situations where it is asked to cache the same replica more than
once.  (Currently, it can actually do two mmaps and mlocks.)  It also needs to handle the
situation where caching a replica is cancelled before said caching completes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message