Return-Path: Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: (qmail 61867 invoked from network); 27 Aug 2010 01:39:17 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 Aug 2010 01:39:17 -0000 Received: (qmail 76580 invoked by uid 500); 27 Aug 2010 01:39:17 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 76520 invoked by uid 500); 27 Aug 2010 01:39:16 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 76512 invoked by uid 99); 27 Aug 2010 01:39:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Aug 2010 01:39:16 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Aug 2010 01:39:16 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o7R1ctCR024912 for ; Fri, 27 Aug 2010 01:38:55 GMT Message-ID: <28050501.22091282873135523.JavaMail.jira@thor> Date: Thu, 26 Aug 2010 21:38:55 -0400 (EDT) From: "Stuart Smith (JIRA)" To: hdfs-issues@hadoop.apache.org Subject: [jira] Commented: (HDFS-1169) Can't read binary data off HDFS via thrift API In-Reply-To: <11370035.15351274435117314.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903177#action_12903177 ] Stuart Smith commented on HDFS-1169: ------------------------------------ Completely hackalicious solution: Be forewarned, if you know how to rebuild: hadoopthriftapi.jar >From the gen-java files generated by the thrift IDL file above, you're WAY better off, and please let me know. Otherwise go to: /hadoop-0.20.2/src/contrib/thriftfs$ Open the file: /hadoop-0.20.2/src/contrib/thriftfs$ gvim src/java/org/apache/hadoop/thriftfs/HadoopThriftServer.java import commons-encoder: import org.apache.commons.codec.binary.Base64; import org.apache.commons.codec.DecoderException; import org.apache.commons.codec.EncoderException; Note that hadoop & the hadoop thrift api's depend on commons-encoder-1.3 , not 1.4. This is unforntunate, because 1.3 has a pretty brain-dead interface. Modify the send and receive functions to treat the string arguments (and return value) as base64 encoded binary: /** * write to a file */ public boolean write(ThriftHandle tout, String encodedData) throws ThriftIOException { try { now = now(); HadoopThriftHandler.LOG.debug("write: " + tout.id); FSDataOutputStream out = (FSDataOutputStream)lookup(tout.id); Base64 base64 = new Base64(); byte[] tmp = null; tmp = (byte[])base64.decode( (byte[]) encodedData.getBytes("UTF-8") ); out.write(tmp, 0, tmp.length); HadoopThriftHandler.LOG.debug("wrote: " + tout.id); return true; } catch (IOException e) { throw new ThriftIOException(e.getMessage()); } } /** * read from a file */ public String read(ThriftHandle tout, long offset, int length) throws ThriftIOException { try { now = now(); HadoopThriftHandler.LOG.debug("read: " + tout.id + " offset: " + offset + " length: " + length); FSDataInputStream in = (FSDataInputStream)lookup(tout.id); if (in.getPos() != offset) { in.seek(offset); } byte[] tmp = new byte[length]; int numbytes = in.read(offset, tmp, 0, length); HadoopThriftHandler.LOG.debug("read done: " + tout.id); try { Base64 base64 = new Base64(); return new String( (byte[])base64.encode( (Object)tmp ), "UTF-8"); } catch( EncoderException e ) { e.printStackTrace(); System.exit(0); return ""; } } catch (IOException e) { throw new ThriftIOException(e.getMessage()); } } Compile: /hadoop-0.20.2/src/contrib/thriftfs$ ant Copy the jar file: hadoop-0.20.2/build/contrib/thriftfs/hadoop-0.20.2-thriftfs.jar to your namenode (or wherever you run your hadoop thrift server from), and drop it in: hadoop-0.20.2/contrib/thriftfs/hadoop-0.20.2-thriftfs.jar (no build dir). then start your thrift server as normal: /hadoop/src/contrib/thriftfs/scripts$ ./start_thrift_server.sh 50050 Now, in all your thrift clients, you have to base64 encode any data before sending it, and decode after receiving. But you can finally get binary data on hdfs. Albeit at a high price in ugliness & performance (coz I'm assuming your storing large files on hdfs...) I've only tested this on one 224 Kb file, but I move everything in 8K chunks client side, so it should work on large files (it'll just be horrifically slow). Again, if anyone figures out how to rebuild: hadoopthriftapi.jar >From the gen-java files, please enlighten! > Can't read binary data off HDFS via thrift API > ---------------------------------------------- > > Key: HDFS-1169 > URL: https://issues.apache.org/jira/browse/HDFS-1169 > Project: Hadoop HDFS > Issue Type: Bug > Components: contrib/thriftfs > Affects Versions: 0.20.2 > Reporter: Erik Forsberg > Attachments: hadoopfs.thrift, HadoopThriftServer.java > > > Trying to access binary data stored in HDFS (in my case, TypedByte files generated by Dumbo) via thrift talking to org.apache.hadoop.thriftfs.HadoopThriftServer, the data I get back is mangled. For example, when I read a file which contains the value 0xa2, it's coming back as 0xef 0xbf 0xbd, also known as the Unicode replacement character. > I think this is because the read method in HadoopThriftServer.java is trying to convert the data read from HDFS into UTF-8 via the String() constructor. > This essentially makes the HDFS thrift API useless for me :-(. > Not being an expert on Thrift, but would it be possible to modify the API so that it uses the binary type listed on http://wiki.apache.org/thrift/ThriftTypes? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.