avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martin Kleppmann (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AVRO-1783) Gracefully handle strings with wrong character encoding
Date Mon, 11 Jan 2016 22:40:39 GMT

     [ https://issues.apache.org/jira/browse/AVRO-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Martin Kleppmann updated AVRO-1783:
    Attachment: AVRO-1783.patch

Attached a patch which I think fixes the broken string handling. It uses {{String#bytesize}}
rather than {{String#size}} to get the length of a string in bytes. For Avro datatype "string",
it ensures that the string is converted to UTF-8 if necessary. For Avro datatypes "binary"
and "fixed", it uses the literal byte sequence passed to the encoder, and ignores any encoding.

Also added test cases that check for all of this. Checked that tests pass in MRI 1.9.3, 2.1,
2.2, JRuby 1.7.3 and 1.7.23.

> Gracefully handle strings with wrong character encoding
> -------------------------------------------------------
>                 Key: AVRO-1783
>                 URL: https://issues.apache.org/jira/browse/AVRO-1783
>             Project: Avro
>          Issue Type: Bug
>          Components: ruby
>    Affects Versions: 1.7.7
>            Reporter: Martin Kleppmann
>         Attachments: AVRO-1783.patch
> In the [vote thread for Avro 1.8.0-rc2|http://mail-archives.apache.org/mod_mbox/avro-dev/201601.mbox/%3CCAGHyZ6K-oe35%2BOYROK6MSwrHxfPHvjmqhJAfRJL2dzexYw6YSw%40mail.gmail.com%3E],
[~busbey] noticed that [phunt's avro-rpc-quickstart|https://github.com/phunt/avro-rpc-quickstart]
> {code}
> busbey$ ruby sample_ipc_client.rb avro_user pat Hello_World
> Avro::IO::AvroTypeError: The datum
> "\x89\xA9\xD1\xFF@NUm\xEA\x9A\xFB\xDAx\xF5Zq"
> is not an example of schema
> {"type":"fixed","name":"MD5","namespace":"org.apache.avro.ipc","size":16}
>               write_data at
> /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:543
>             write_record at
> /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:610
>                     each at org/jruby/RubyArray.java:1613
>             write_record at
> /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:609
>               write_data at
> /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:561
>                    write at
> /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:538
>  write_handshake_request at
> /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/ipc.rb:136
>                  request at
> /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/ipc.rb:105
>                  request at
> /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/ipc.rb:117
>                   (root) at sample_ipc_client.rb:49
> {code}
> I tried reproducing the error, and it is quite strange. avro-rpc-quickstart works fine
for me in Ruby (MRI) 2.2 and 2.1, and in JRuby 1.7.23. However, [~busbey] was using JRuby
1.7.3 (as visible from the path names above), and in this particular version of JRuby I was
able to reproduce the issue.
> It seems that in some circumstances (but not always, bizarrely), JRuby 1.7.3 returns
a UTF-8 encoded string from {{Digest::MD5.digest}}, rather than a binary-encoded string. {{Schema.validate}}
checks that the string is suitable for writing as datum for a {{fixed}} type by calling {{#size}}.
In this case, although the MD5 digest of the schema is a 16-byte string, if you interpret
it as a UTF-8 encoded string, it consists of only 13 characters (i.e. some sequences are interpreted
as multibyte characters).
> Rather than trying to divine why JRuby is being weird here, I think this is an opportunity
to fix Avro's handling of strings to make it robust against unexpected encodings.

This message was sent by Atlassian JIRA

View raw message