Return-Path: X-Original-To: apmail-avro-dev-archive@www.apache.org Delivered-To: apmail-avro-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2749318A77 for ; Mon, 11 Jan 2016 22:40:41 +0000 (UTC) Received: (qmail 63515 invoked by uid 500); 11 Jan 2016 22:40:40 -0000 Delivered-To: apmail-avro-dev-archive@avro.apache.org Received: (qmail 63442 invoked by uid 500); 11 Jan 2016 22:40:40 -0000 Mailing-List: contact dev-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@avro.apache.org Delivered-To: mailing list dev@avro.apache.org Received: (qmail 63172 invoked by uid 99); 11 Jan 2016 22:40:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Jan 2016 22:40:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D3A552C1F58 for ; Mon, 11 Jan 2016 22:40:39 +0000 (UTC) Date: Mon, 11 Jan 2016 22:40:39 +0000 (UTC) From: "Martin Kleppmann (JIRA)" To: dev@avro.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (AVRO-1783) Gracefully handle strings with wrong character encoding MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/AVRO-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Kleppmann updated AVRO-1783: ----------------------------------- Attachment: AVRO-1783.patch Attached a patch which I think fixes the broken string handling. It uses {{String#bytesize}} rather than {{String#size}} to get the length of a string in bytes. For Avro datatype "string", it ensures that the string is converted to UTF-8 if necessary. For Avro datatypes "binary" and "fixed", it uses the literal byte sequence passed to the encoder, and ignores any encoding. Also added test cases that check for all of this. Checked that tests pass in MRI 1.9.3, 2.1, 2.2, JRuby 1.7.3 and 1.7.23. > Gracefully handle strings with wrong character encoding > ------------------------------------------------------- > > Key: AVRO-1783 > URL: https://issues.apache.org/jira/browse/AVRO-1783 > Project: Avro > Issue Type: Bug > Components: ruby > Affects Versions: 1.7.7 > Reporter: Martin Kleppmann > Attachments: AVRO-1783.patch > > > In the [vote thread for Avro 1.8.0-rc2|http://mail-archives.apache.org/mod_mbox/avro-dev/201601.mbox/%3CCAGHyZ6K-oe35%2BOYROK6MSwrHxfPHvjmqhJAfRJL2dzexYw6YSw%40mail.gmail.com%3E], [~busbey] noticed that [phunt's avro-rpc-quickstart|https://github.com/phunt/avro-rpc-quickstart] fails: > {code} > busbey$ ruby sample_ipc_client.rb avro_user pat Hello_World > Avro::IO::AvroTypeError: The datum > "\x89\xA9\xD1\xFF@NUm\xEA\x9A\xFB\xDAx\xF5Zq" > is not an example of schema > {"type":"fixed","name":"MD5","namespace":"org.apache.avro.ipc","size":16} > write_data at > /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:543 > write_record at > /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:610 > each at org/jruby/RubyArray.java:1613 > write_record at > /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:609 > write_data at > /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:561 > write at > /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:538 > write_handshake_request at > /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/ipc.rb:136 > request at > /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/ipc.rb:105 > request at > /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/ipc.rb:117 > (root) at sample_ipc_client.rb:49 > {code} > I tried reproducing the error, and it is quite strange. avro-rpc-quickstart works fine for me in Ruby (MRI) 2.2 and 2.1, and in JRuby 1.7.23. However, [~busbey] was using JRuby 1.7.3 (as visible from the path names above), and in this particular version of JRuby I was able to reproduce the issue. > It seems that in some circumstances (but not always, bizarrely), JRuby 1.7.3 returns a UTF-8 encoded string from {{Digest::MD5.digest}}, rather than a binary-encoded string. {{Schema.validate}} checks that the string is suitable for writing as datum for a {{fixed}} type by calling {{#size}}. In this case, although the MD5 digest of the schema is a 16-byte string, if you interpret it as a UTF-8 encoded string, it consists of only 13 characters (i.e. some sequences are interpreted as multibyte characters). > Rather than trying to divine why JRuby is being weird here, I think this is an opportunity to fix Avro's handling of strings to make it robust against unexpected encodings. -- This message was sent by Atlassian JIRA (v6.3.4#6332)