Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 613B017ECF for ; Mon, 10 Nov 2014 07:29:34 +0000 (UTC) Received: (qmail 59717 invoked by uid 500); 10 Nov 2014 07:29:33 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 59601 invoked by uid 500); 10 Nov 2014 07:29:33 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 59588 invoked by uid 99); 10 Nov 2014 07:29:33 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Nov 2014 07:29:33 +0000 Date: Mon, 10 Nov 2014 07:29:33 +0000 (UTC) From: "jiangyu (JIRA)" To: hdfs-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HDFS-7385) ThreadLocal used in FSEditLog class lead FSImage permission mess up MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 jiangyu created HDFS-7385: ----------------------------- Summary: ThreadLocal used in FSEditLog class lead FSImage per= mission mess up Key: HDFS-7385 URL: https://issues.apache.org/jira/browse/HDFS-7385 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.5.0, 2.4.0 Reporter: jiangyu We migrated our NameNodes from low configuration to high configuratio= n machines last week. Firstly,we imported the current directory including = fsimage and editlog files from original ActiveNameNode to new ActiveNameNod= e and started the New NameNode, then changed the configuration of all data= nodes and restarted all of datanodes , then blockreport to new NameNodes at= once and send heartbeat after that. Everything seemed perfect, but after we restarted Resoucemanager , m= ost of the users compained that their jobs couldn't be executed for the rea= son of permission problem. We applied Acls in our clusters, and after migrated we found most of = the directories and files which were not set Acls before now had the proper= ties of Acls. That is the reason why users could not execute their jobs.So = we had to change most of the files permission to a+r and directories permis= sion to a+rx to make sure the jobs can be executed. After searching this problem for some days, i found there is a bug in FSEdi= tLog.java. The ThreadLocal variable cache in FSEditLog don=E2=80=99t set th= e proper value in logMkdir and logOpenFile functions. Here is the code of l= ogMkdir: public void logMkDir(String path, INode newNode) { PermissionStatus permissions =3D newNode.getPermissionStatus(); MkdirOp op =3D MkdirOp.getInstance(cache.get()) .setInodeId(newNode.getId()) .setPath(path) .setTimestamp(newNode.getModificationTime()) .setPermissionStatus(permissions); AclFeature f =3D newNode.getAclFeature(); if (f !=3D null) { op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode)); } logEdit(op); } For example, if we mkdir with Acls through one handler(Thread indeed)= , we set the AclEntries to the op from the cache. After that, if we mkdir w= ithout any Acls setting and set through the same handler, the AclEnties fro= m the cache is the same with the last one which set the Acls, and because t= he newNode have no AclFeature, we don=E2=80=99t have any chance to change i= t. Then the editlog is wrong,record the wrong Acls. After the Standby load = the editlogs from journalnodes and apply them to memory in SNN then savena= mespace and transfer the wrong fsimage to ANN, all the fsimages get wrong. = The only solution is to save namespace from ANN and you can get the right f= simage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)