hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7496) Fix FsVolume removal race conditions on the DataNode
Date Mon, 08 Dec 2014 23:02:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238621#comment-14238621
] 

Colin Patrick McCabe commented on HDFS-7496:
--------------------------------------------

So, FsVolume removal can happen because of DN reconfiguration (HDFS-6727), or because a failure
was detected in {{FsVolumeList#checkDirs}} (see HDFS-7489 for more discussion).  While we
can prevent certain race conditions by locking the {{FsVolumeList}} object itself, other race
conditions are more fundamental.  For example, if someone calls {{FsVolumeList#getNextVolume}},
the volume instance they get back may be removed before they use it, or even while they are
using it.

We can't fix this with a "big lock" unless we lock all operations which use volumes, which
seems unreasonable.  We could fix this in a few different ways.  We could do explicit reference
counting.  This is a bit tricky because someone might forget to unreference the volume after
using it.  It's kind of like a file descriptor leak at that point.  Another way would be to
use Java's {{PhantomReference}} stuff to determine when the {{FsVolumeImpl}} objects are no
longer being referenced.

A related point is that we often refer to volumes by their base path.  But actually, we could
destroy a volume and re-create another volume with the same base path.  This leads to a lot
of subtle races.  To solve this, we could try to start using storageIDs more heavily, because
they are globally unique.  I'm not sure if there is any other good solution to this?

> Fix FsVolume removal race conditions on the DataNode 
> -----------------------------------------------------
>
>                 Key: HDFS-7496
>                 URL: https://issues.apache.org/jira/browse/HDFS-7496
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>
> We discussed a few FsVolume removal race conditions on the DataNode in HDFS-7489.  We
should figure out a way to make removing an FsVolume safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message