hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiangyu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-7385) ThreadLocal used in FSEditLog class lead FSImage permission mess up
Date Mon, 10 Nov 2014 07:29:33 GMT
jiangyu created HDFS-7385:
-----------------------------

             Summary: ThreadLocal used in FSEditLog class  lead FSImage permission mess up
                 Key: HDFS-7385
                 URL: https://issues.apache.org/jira/browse/HDFS-7385
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.5.0, 2.4.0
            Reporter: jiangyu


      We migrated our NameNodes from low configuration to high configuration machines last
week. Firstly,we  imported the current directory including fsimage and editlog files from
original ActiveNameNode to new ActiveNameNode and started the New NameNode, then  changed
the configuration of all datanodes and restarted all of datanodes , then blockreport to new
NameNodes at once and send heartbeat after that.
       Everything seemed perfect, but after we restarted Resoucemanager , most of the users
compained that their jobs couldn't be executed for the reason of permission problem.
      We applied Acls in our clusters, and after migrated we found most of the directories
and files which were not set Acls before now had the properties of Acls. That is the reason
why users could not execute their jobs.So we had to change most of the files permission to
a+r and directories permission to a+rx to make sure the jobs can be executed.
After searching this problem for some days, i found there is a bug in FSEditLog.java. The
ThreadLocal variable cache in FSEditLog don’t set the proper value in logMkdir and logOpenFile
functions. Here is the code of logMkdir:
  public void logMkDir(String path, INode newNode) {
    PermissionStatus permissions = newNode.getPermissionStatus();
    MkdirOp op = MkdirOp.getInstance(cache.get())
      .setInodeId(newNode.getId())
      .setPath(path)
      .setTimestamp(newNode.getModificationTime())
      .setPermissionStatus(permissions);

    AclFeature f = newNode.getAclFeature();
    if (f != null) {
      op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
    }
    logEdit(op);
  }
      For example, if we mkdir with Acls through one handler(Thread indeed), we set the AclEntries
to the op from the cache. After that, if we mkdir without any Acls setting and set through
the same handler, the AclEnties from the cache is the same with the last one which set the
Acls, and because the newNode have no AclFeature, we don’t have any chance to change it.
Then the editlog is wrong,record the wrong Acls. After the Standby load the editlogs from
journalnodes and  apply them to memory in SNN then savenamespace and transfer the wrong fsimage
to ANN, all the fsimages get wrong. The only solution is to save namespace from ANN and you
can get the right fsimage.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message