Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 739229294 for ; Thu, 29 Nov 2012 23:26:59 +0000 (UTC) Received: (qmail 25168 invoked by uid 500); 29 Nov 2012 23:26:59 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 25133 invoked by uid 500); 29 Nov 2012 23:26:59 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 25125 invoked by uid 99); 29 Nov 2012 23:26:59 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Nov 2012 23:26:59 +0000 Date: Thu, 29 Nov 2012 23:26:59 +0000 (UTC) From: "Todd Lipcon (JIRA)" To: common-issues@hadoop.apache.org Message-ID: <209317300.42891.1354231619066.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (HADOOP-9103) UTF8 class does not properly decode Unicode characters outside the basic multilingual plane MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506933#comment-13506933 ] Todd Lipcon commented on HADOOP-9103: ------------------------------------- bq. Rather than adding a comment saying "this code is buggy", how about we fix the bug? Outputting proper 4-byte UTF8 sequences for a given UTF-16 surrogate pair is a much better solution than the current behavior. It's not "buggy" it's just "different" (reminds me of something my elementary school teachers used to say). But on a serious note, yea, what Colin said above -- it could break existing clients of the code who are using the old code to _decode_, and were relying on the fact that we are able to round-trip non-BMP characters through UTF8.java. > UTF8 class does not properly decode Unicode characters outside the basic multilingual plane > ------------------------------------------------------------------------------------------- > > Key: HADOOP-9103 > URL: https://issues.apache.org/jira/browse/HADOOP-9103 > Project: Hadoop Common > Issue Type: Bug > Components: io > Affects Versions: 0.20.1 > Environment: SUSE LINUX > Reporter: yixiaohua > Assignee: Todd Lipcon > Attachments: FSImage.java, hadoop-9103.txt, hadoop-9103.txt, hadoop-9103.txt, ProblemString.txt, TestUTF8AndStringGetBytes.java, TestUTF8AndStringGetBytes.java > > Original Estimate: 12h > Remaining Estimate: 12h > > this the log information of the exception from the SecondaryNameNode: > 2012-03-28 00:48:42,553 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.io.IOException: Found lease for > non-existent file /user/boss/pgv/fission/task16/split/_temporary/_attempt_201203271849_0016_r_000174_0/????@??????????????? > ??????????tor.qzone.qq.com/keypart-00174 > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFilesUnderConstruction(FSImage.java:1211) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:959) > at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:589) > at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:473) > at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:350) > at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:314) > at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225) > at java.lang.Thread.run(Thread.java:619) > this is the log information about the file from namenode: > 2012-03-28 00:32:26,528 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=boss,boss ip=/10.131.16.34 cmd=create src=/user/boss/pgv/fission/task16/split/_temporary/_attempt_201203271849_0016_r_000174_0/ @? tor.qzone.qq.com/keypart-00174 dst=null perm=boss:boss:rw-r--r-- > 2012-03-28 00:37:42,387 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /user/boss/pgv/fission/task16/split/_temporary/_attempt_201203271849_0016_r_000174_0/ @? tor.qzone.qq.com/keypart-00174. blk_2751836614265659170_184668759 > 2012-03-28 00:37:42,696 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /user/boss/pgv/fission/task16/split/_temporary/_attempt_201203271849_0016_r_000174_0/ @? tor.qzone.qq.com/keypart-00174 is closed by DFSClient_attempt_201203271849_0016_r_000174_0 > 2012-03-28 00:37:50,315 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=boss,boss ip=/10.131.16.34 cmd=rename src=/user/boss/pgv/fission/task16/split/_temporary/_attempt_201203271849_0016_r_000174_0/ @? tor.qzone.qq.com/keypart-00174 dst=/user/boss/pgv/fission/task16/split/ @? tor.qzone.qq.com/keypart-00174 perm=boss:boss:rw-r--r-- > after check the code that save FSImage,I found there are a problem that maybe a bug of HDFS Code,I past below: > -------------this is the saveFSImage method in FSImage.java, I make some mark at the problem code------------ > /** > * Save the contents of the FS image to the file. > */ > void saveFSImage(File newFile) throws IOException { > FSNamesystem fsNamesys = FSNamesystem.getFSNamesystem(); > FSDirectory fsDir = fsNamesys.dir; > long startTime = FSNamesystem.now(); > // > // Write out data > // > DataOutputStream out = new DataOutputStream( > new BufferedOutputStream( > new FileOutputStream(newFile))); > try { > ......... > > // save the rest of the nodes > saveImage(strbuf, 0, fsDir.rootDir, out);------------------problem > fsNamesys.saveFilesUnderConstruction(out);------------------problem detail is below > strbuf = null; > } finally { > out.close(); > } > LOG.info("Image file of size " + newFile.length() + " saved in " > + (FSNamesystem.now() - startTime)/1000 + " seconds."); > } > /** > * Save file tree image starting from the given root. > * This is a recursive procedure, which first saves all children of > * a current directory and then moves inside the sub-directories. > */ > private static void saveImage(ByteBuffer parentPrefix, > int prefixLength, > INodeDirectory current, > DataOutputStream out) throws IOException { > int newPrefixLength = prefixLength; > if (current.getChildrenRaw() == null) > return; > for(INode child : current.getChildren()) { > // print all children first > parentPrefix.position(prefixLength); > parentPrefix.put(PATH_SEPARATOR).put(child.getLocalNameBytes());------------------problem > saveINode2Image(parentPrefix, child, out); > } > .......... > } > // Helper function that writes an INodeUnderConstruction > // into the input stream > // > static void writeINodeUnderConstruction(DataOutputStream out, > INodeFileUnderConstruction cons, > String path) > throws IOException { > writeString(path, out);------------------problem > .......... > } > > static private final UTF8 U_STR = new UTF8(); > static void writeString(String str, DataOutputStream out) throws IOException { > U_STR.set(str); > U_STR.write(out);------------------problem > } > /** > * Converts a string to a byte array using UTF8 encoding. > */ > static byte[] string2Bytes(String str) { > try { > return str.getBytes("UTF8");------------------problem > } catch(UnsupportedEncodingException e) { > assert false : "UTF8 encoding is not supported "; > } > return null; > } > ------------------------------------------below is the explain------------------------ > in saveImage method: child.getLocalNameBytes(),the bytes use the method of str.getBytes("UTF8"); > but in writeINodeUnderConstruction, the bytes user the method of Class UTF8 to get the bytes. > I make a test use our messy code file name , found the the two bytes arrsy are not equal. so I both use the class UTF8,then the problem desappare. > I think this is a bug of HDFS or UTF8. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira