thrift-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Beyer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" fields are re-encoded
Date Mon, 12 Nov 2012 21:15:13 GMT

    [ https://issues.apache.org/jira/browse/THRIFT-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13495645#comment-13495645
] 

Nathan Beyer commented on THRIFT-1727:
--------------------------------------

I believe the core issue is that there is no 'binary' type. According to the Thrift Types
(http://thrift.apache.org/docs/types/) document, there is only a 'string' base type and a
'binary' special type that is a specialized form of 'string'. 

I'm not sure how this manifests on other languages, but in Ruby, when an IDL has a 'binary'
type, will add some metadata to the field definitions. Here's an example -
{code}
# IDL with a struct that has string and binary types
struct Combo {
  1: string sdata
  2: binary bdata
}

# Generated Ruby code
    class Combo
      include ::Thrift::Struct, ::Thrift::Struct_Union
      SDATA = 1
      BDATA = 2

      FIELDS = {
        SDATA => {:type => ::Thrift::Types::STRING, :name => 'sdata'},
        BDATA => {:type => ::Thrift::Types::STRING, :name => 'bdata', :binary =>
true}
      }

      def struct_fields; FIELDS; end

      def validate
      end

      ::Thrift::Struct.generate_accessors self
    end
{code}

Unfortunately, this field information is not available in the protocol classes when serializing
and deserializing. Since 'binary' is not a base type, there is no 'write_binary' or 'read_binary'.
As such, all that's invoked is 'write_string' or 'read_string' and these methods don't seem
to have enough context to get that field definition data. Please let me know if there is access
to this information, as it could be used to avoid transcoding the data and forcing the encoding
to BINARY.

How are the other libraries dealing with this special 'binary' type?
                
> Ruby-1.9: data loss: "binary" fields are re-encoded
> ---------------------------------------------------
>
>                 Key: THRIFT-1727
>                 URL: https://issues.apache.org/jira/browse/THRIFT-1727
>             Project: Thrift
>          Issue Type: Bug
>          Components: Ruby - Library
>    Affects Versions: 0.9
>         Environment: JRuby 1.6.8 using "--1.9" command line parameter.
>            Reporter: XB
>
> When setting a binary field of a Thrift object with some binary data (e.g. a string whose
encoding is "ASCII-8BIT") and then serializing this object, the binary data is re-encoded.
That is, it is encoded as if it were not a sequence of bytes but a sequence of characters,
encoded using the ISO-8859-1 encoding. This assumed ISO-8859-1 sequence of characters is then
converted into UTF-8 (by BinaryProtocol or CompactProtocol). This basically means that all
bytes whose values are between 0x80 (inclusive) and 0x100 (exclusive) are converted into multi-byte
sequences. This leads to data corruption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message