hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stuart Smith (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1169) Can't read binary data off HDFS via thrift API
Date Fri, 27 Aug 2010 01:38:55 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903177#action_12903177
] 

Stuart Smith commented on HDFS-1169:
------------------------------------

Completely hackalicious solution:

Be forewarned, if you know how to rebuild: 

hadoopthriftapi.jar 

>From the gen-java files generated by the thrift IDL file above, you're WAY better off,
and please let me know.
Otherwise go to:

/hadoop-0.20.2/src/contrib/thriftfs$

Open the file:

/hadoop-0.20.2/src/contrib/thriftfs$ gvim src/java/org/apache/hadoop/thriftfs/HadoopThriftServer.java

import commons-encoder:

import org.apache.commons.codec.binary.Base64;
import org.apache.commons.codec.DecoderException;
import org.apache.commons.codec.EncoderException;

Note that hadoop &  the hadoop thrift api's depend on commons-encoder-1.3 , not 1.4.
This is unforntunate, because 1.3 has a pretty brain-dead interface.

Modify the send and receive functions to treat the string arguments (and return value) as
base64 encoded binary:

    /**
     * write to a file
     */
    public boolean write(ThriftHandle tout, String encodedData) throws ThriftIOException {
      try {
        now = now();
        HadoopThriftHandler.LOG.debug("write: " + tout.id);
        FSDataOutputStream out = (FSDataOutputStream)lookup(tout.id);
        Base64 base64 = new Base64();
        byte[] tmp = null;
        tmp = (byte[])base64.decode( (byte[]) encodedData.getBytes("UTF-8") );
            
        out.write(tmp, 0, tmp.length);
        HadoopThriftHandler.LOG.debug("wrote: " + tout.id);
        return true;
      } catch (IOException e) {
        throw new ThriftIOException(e.getMessage());
      }
    }

    /**
     * read from a file
     */
    public String read(ThriftHandle tout, long offset,
                       int length) throws ThriftIOException {
      try {
        now = now();
        HadoopThriftHandler.LOG.debug("read: " + tout.id +
                                     " offset: " + offset +
                                     " length: " + length);
        FSDataInputStream in = (FSDataInputStream)lookup(tout.id);
        if (in.getPos() != offset) {
          in.seek(offset);
        }
        byte[] tmp = new byte[length];
        int numbytes = in.read(offset, tmp, 0, length);
        HadoopThriftHandler.LOG.debug("read done: " + tout.id);
        try
        {
            Base64 base64 = new Base64();
            return new String( (byte[])base64.encode( (Object)tmp ), "UTF-8");
        }
        catch( EncoderException e )
        {
            e.printStackTrace();
            System.exit(0);
            return "";
        }
      } catch (IOException e) {
        throw new ThriftIOException(e.getMessage());
      }
    }

Compile:

/hadoop-0.20.2/src/contrib/thriftfs$ ant

Copy the jar file:

hadoop-0.20.2/build/contrib/thriftfs/hadoop-0.20.2-thriftfs.jar

to your namenode (or wherever you run your hadoop thrift server from), and drop it in:

hadoop-0.20.2/contrib/thriftfs/hadoop-0.20.2-thriftfs.jar

(no build dir).

then start your thrift server as normal:

/hadoop/src/contrib/thriftfs/scripts$ ./start_thrift_server.sh 50050


Now, in all your thrift clients, you have to base64 encode any data before sending it, and
decode after receiving.

But you can finally get binary data on hdfs. Albeit at a high price in ugliness & performance
(coz I'm assuming your storing large files on hdfs...)

I've only tested this on one 224 Kb file, but I move everything in 8K chunks client side,
so it should work on large files (it'll just be horrifically slow).

Again, if anyone figures out how to rebuild: 

hadoopthriftapi.jar 

>From the gen-java files, please enlighten!




> Can't read binary data off HDFS via thrift API
> ----------------------------------------------
>
>                 Key: HDFS-1169
>                 URL: https://issues.apache.org/jira/browse/HDFS-1169
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: contrib/thriftfs
>    Affects Versions: 0.20.2
>            Reporter: Erik Forsberg
>         Attachments: hadoopfs.thrift, HadoopThriftServer.java
>
>
> Trying to access binary data stored in HDFS (in my case, TypedByte files generated by
Dumbo) via thrift talking to org.apache.hadoop.thriftfs.HadoopThriftServer, the data I get
back is mangled. For example, when I read a file which contains the value 0xa2, it's coming
back as 0xef 0xbf 0xbd, also known as the Unicode replacement character.
> I think this is because the read method in HadoopThriftServer.java is trying to convert
the data read from HDFS into UTF-8 via the String() constructor. 
> This essentially makes the HDFS thrift API useless for me :-(.
> Not being an expert on Thrift, but would it be possible to modify the API so that it
uses the binary type listed on http://wiki.apache.org/thrift/ThriftTypes?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message