hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-6618) Remove deleted INodes from INodeMap right away
Date Thu, 03 Jul 2014 20:40:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051887#comment-14051887
] 

Kihwal Lee edited comment on HDFS-6618 at 7/3/14 8:39 PM:
----------------------------------------------------------

[~cmccabe], thanks for the review.
bq. What happens if a QuotaExceededException is thrown here .....
This is indeed problematic, but is also the case for existing code and what you are suggesting.
 If an exception is thrown in the middle of deleting, the partial delete is not undone. The
inode at the top of the tree being deleted and potentially more will have already unlinked
and the rest will remain linked, but unreachable.  If inodes are  removed altogether at the
end, none of inodes will get removed from inodeMap, when an exception is thrown. This will
cause inodes and blocks to leak.  If we remove inodes as we go, at leaset some inodes will
get removed in the same situation. Either way things will leak, but to a lesser degree in
the latter case. But I wouldn't say the latter is superior because of this difference. I am
just saying it's no worse.

One of the key motivation of removing inodes inline was to avoid overhead of building up large
data structure when deleting a large tree. Although now it's backed by {{ChunkedArrayList}},
there will be lots of realloc and quite a bit of memory consumption. All or part of them may
be promoted and remain in the heap until the next old gen collection.  This may be acceptable
if we are doing deferred removal outside the lock. But since we are trying to do it inside
both FSNamesystem and FSDirectory lock, building the list is just a waste.

About leaking inodes and blocks: 

- inodes were removed from inodeMap but blocks weren't. This includes adding a block after
the inode is deleted due to the delete-addBlock race.  Since the block is not removed from
blocksMap, but the block still has reference to the block collection (i.e. inode), the block
will look valid to {{BlockManager}}.  This will cause memory leak, which will disappear when
namenode is restarted.

- unlinked/deleted inodes were not deleted from inodeMap.  The deleted inodes will remain
in memory. If blocks were also not removed from blocksMap, they will remain in memory.  If
blocks were collected, but not removed from blocksMap, they will disappear after restart.
When saving fsimage, the orphaned inodes will be saved in the inode section. The way it saves
INodeDirectorySection also causes all leaked (still linked) children and blocks to be saved.
When loading the fsimage, the leak will be recreated in memory.

I am a bit depressed after writing this. Let's fix things one at time...


was (Author: kihwal):
[~cmccabe], thanks for the review.
bq. What happens if a QuotaExceededException is thrown here .....
This is indeed problematic, but is also the case for existing code and what you are suggesting.
 If an exception is thrown in the middle of deleting, the partial delete is not undone. The
inode at the top of the tree being deleted and potentially more will have already unlinked
and the rest will remain linked, but unreachable.  If inodes are  removed altogether at the
end, none of inodes will get removed from inodeMap, when an exception is thrown. This will
cause inodes and blocks to leak.  If we remove inodes as we go, at leaset some inodes will
get removed in the same situation. Either way things will leak, but to a lesser degree in
the latter case. But I wouldn't say the latter is superior because of this difference. I am
just saying it's no worse.

One of the key motivation of removing inodes inline was to avoid overhead of building up large
data structure when deleting a large tree. Although now it's backed by {{ChunkedArrayList}},
there will be lots of realloc and quite a bit of memory consumption. All or part of them may
be promoted and remain in the heap until the next old gen collection.  This may be acceptable
if we are doing deferred removal outside the lock. But since we are trying to do it inside
both FSNamesystem and FSDirectory lock, building the list is just a waste.

About leaking inodes and blocks: 
- inodes were removed from inodeMap but blocks weren't. This includes adding a block after
the inode is deleted due to the delete-addBlock race.  Since the block is not removed from
blocksMap, but the block still has reference to the block collection (i.e. inode).  This will
cause memory leak, which will disappear when namenode is restarted.
- unlinked/deleted inodes were not deleted from inodeMap.  The deleted inodes will remain
in memory. If blocks were also not removed from blocksMap, they will remain in memory.  If
blocks were collected, but not removed from blocksMap, they will disappear after restart.
When saving fsimage, the orphaned inodes will be saved in the inode section. The way it saves
INodeDirectorySection also causes all leaked (still linked) children and blocks to be saved.
When loading the fsimage, the leak will be recreated in memory.

I am a bit depressed after writing this. Let's fix things one at time...

> Remove deleted INodes from INodeMap right away
> ----------------------------------------------
>
>                 Key: HDFS-6618
>                 URL: https://issues.apache.org/jira/browse/HDFS-6618
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.5.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Blocker
>         Attachments: HDFS-6618.AbstractList.patch, HDFS-6618.inodeRemover.patch, HDFS-6618.inodeRemover.v2.patch,
HDFS-6618.patch
>
>
> After HDFS-6527, we have not seen the edit log corruption for weeks on multiple clusters
until yesterday. Previously, we would see it within 30 minutes on a cluster.
> But the same condition was reproduced even with HDFS-6527.  The only explanation is that
the RPC handler thread serving {{addBlock()}} was accessing stale parent value.  Although
nulling out parent is done inside the {{FSNamesystem}} and {{FSDirectory}} write lock, there
is no memory barrier because there is no "synchronized" block involved in the process.
> I suggest making parent volatile.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message