hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-7385) ThreadLocal used in FSEditLog class lead FSImage permission mess up
Date Thu, 13 Nov 2014 17:12:34 GMT

     [ https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Nauroth updated HDFS-7385:
--------------------------------
    Status: Patch Available  (was: Open)

> ThreadLocal used in FSEditLog class  lead FSImage permission mess up
> --------------------------------------------------------------------
>
>                 Key: HDFS-7385
>                 URL: https://issues.apache.org/jira/browse/HDFS-7385
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.5.0, 2.4.0
>            Reporter: jiangyu
>            Assignee: jiangyu
>            Priority: Critical
>         Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>       We migrated our NameNodes from low configuration to high configuration machines
last week. Firstly,we  imported the current directory including fsimage and editlog files
from original ActiveNameNode to new ActiveNameNode and started the New NameNode, then  changed
the configuration of all datanodes and restarted all of datanodes , then blockreport to new
NameNodes at once and send heartbeat after that.
>        Everything seemed perfect, but after we restarted Resoucemanager , most of the
users compained that their jobs couldn't be executed for the reason of permission problem.
>       We applied Acls in our clusters, and after migrated we found most of the directories
and files which were not set Acls before now had the properties of Acls. That is the reason
why users could not execute their jobs.So we had to change most of the files permission to
a+r and directories permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in FSEditLog.java.
The ThreadLocal variable cache in FSEditLog don’t set the proper value in logMkdir and logOpenFile
functions. Here is the code of logMkdir:
>   public void logMkDir(String path, INode newNode) {
>     PermissionStatus permissions = newNode.getPermissionStatus();
>     MkdirOp op = MkdirOp.getInstance(cache.get())
>       .setInodeId(newNode.getId())
>       .setPath(path)
>       .setTimestamp(newNode.getModificationTime())
>       .setPermissionStatus(permissions);
>     AclFeature f = newNode.getAclFeature();
>     if (f != null) {
>       op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
>     }
>     logEdit(op);
>   }
>       For example, if we mkdir with Acls through one handler(Thread indeed), we set the
AclEntries to the op from the cache. After that, if we mkdir without any Acls setting and
set through the same handler, the AclEnties from the cache is the same with the last one which
set the Acls, and because the newNode have no AclFeature, we don’t have any chance to change
it. Then the editlog is wrong,record the wrong Acls. After the Standby load the editlogs from
journalnodes and  apply them to memory in SNN then savenamespace and transfer the wrong fsimage
to ANN, all the fsimages get wrong. The only solution is to save namespace from ANN and you
can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message