avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tophe Vigny (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1206) utf-8 serialisation problems
Date Wed, 21 Nov 2012 14:23:58 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502001#comment-13502001
] 

Tophe Vigny commented on AVRO-1206:
-----------------------------------

hi Doug,

you are using ruby 1.8.x (oldest branch), try with ruby > 1.9.x (official branch), you
can use rvm (ruby version manager) to install multiple ruby version.

Tophe@info3:~/work/svn_1/trunk/lang/ruby$ rvm use 1.8.7
Using /home/Tophe/.rvm/gems/ruby-1.8.7-p371
Tophe@info3:~/work/svn_1/trunk/lang/ruby$ rake test
/home/Tophe/work/svn_1/trunk/lang/ruby/Rakefile:19: warning: already initialized constant
VERSION
/home/Tophe/.rvm/rubies/ruby-1.8.7-p371/bin/ruby -I"lib:ext:bin:test" -I"/home/Tophe/.rvm/gems/ruby-1.8.7-p371@global/gems/rake-10.0.2/lib"
"/home/Tophe/.rvm/gems/ruby-1.8.7-p371@global/gems/rake-10.0.2/lib/rake/rake_test_loader.rb"
"test/test_socket_transport.rb" "test/test_io.rb" "test/test_datafile.rb" "test/test_help.rb"
"test/test_protocol.rb" 
Loaded suite /home/Tophe/.rvm/gems/ruby-1.8.7-p371@global/gems/rake-10.0.2/lib/rake/rake_test_loader
Started
................................
Finished in 0.536805 seconds.

32 tests, 710 assertions, 0 failures, 0 errors


Tophe@info3:~/work/svn_1/trunk/lang/ruby$ rvm use 1.9.3
Using /home/Tophe/.rvm/gems/ruby-1.9.3-p327
Tophe@info3:~/work/svn_1/trunk/lang/ruby$ rake test
/home/Tophe/.rvm/rubies/ruby-1.9.3-p327/bin/ruby -I"lib:ext:bin:test" -I"/home/Tophe/.rvm/gems/ruby-1.9.3-p327@global/gems/rake-10.0.2/lib"
"/home/Tophe/.rvm/gems/ruby-1.9.3-p327@global/gems/rake-10.0.2/lib/rake/rake_test_loader.rb"
"test/test_socket_transport.rb" "test/test_io.rb" "test/test_datafile.rb" "test/test_help.rb"
"test/test_protocol.rb" 
Run options: 

# Running tests:

...F............................

Finished tests in 0.212220s, 150.7870 tests/s, 3345.5875 assertions/s.

  1) Failure:
test_utf8(TestDataFile) [/home/Tophe/work/svn_1/trunk/lang/ruby/test/test_datafile.rb:152]:
<"家"> expected but was
<"\xE5\xAE\xB6">.

32 tests, 710 assertions, 1 failures, 0 errors, 0 skips
rake aborted!

apply that modif :

Index: test/test_datafile.rb
===================================================================
--- test/test_datafile.rb	(revision 1410649)
+++ test/test_datafile.rb	(working copy)
@@ -1,3 +1,4 @@
+# -*- coding: utf-8 -*-
 # Licensed to the Apache Software Foundation (ASF) under one
 # or more contributor license agreements.  See the NOTICE file
 # distributed with this work for additional information
@@ -140,4 +141,17 @@
       assert_equal(block_count+1, dw.block_count)
     end
   end
+  def test_utf8
+    datafile = Avro::DataFile::open('data.avr', 'w', '"string"')
+    datafile << "家"
+    datafile.close
+    
+    datafile = Avro::DataFile.open('data.avr')
+    datafile.each do |s|
+      (rmaj,rmin,rlast) = RUBY_VERSION.split(".").map {|a| a.to_i}
+      if rmaj <2 && rmin < 9
+        assert_equal "家", s
+      else
+        assert_equal "家", s.force_encoding('UTF-8')
+      end
+    end
+    datafile.close
+    end
+  end

Tophe@info3:~/work/svn_1/trunk/lang/ruby$ rake test
/home/Tophe/.rvm/rubies/ruby-1.9.3-p327/bin/ruby -I"lib:ext:bin:test" -I"/home/Tophe/.rvm/gems/ruby-1.9.3-p327@global/gems/rake-10.0.2/lib"
"/home/Tophe/.rvm/gems/ruby-1.9.3-p327@global/gems/rake-10.0.2/lib/rake/rake_test_loader.rb"
"test/test_socket_transport.rb" "test/test_io.rb" "test/test_datafile.rb" "test/test_help.rb"
"test/test_protocol.rb" 
Run options: 

# Running tests:

................................

Finished tests in 0.166176s, 192.5669 tests/s, 4272.5791 assertions/s.

32 tests, 710 assertions, 0 failures, 0 errors, 0 skips

and now change
      def write_bytes(datum)
        write_long(datum.size)
        @writer.write(datum)
      end
      
      
and run test in 1.9.3
Tophe@info3:~/work/svn_1/trunk/lang/ruby$ rake test
/home/Tophe/.rvm/rubies/ruby-1.9.3-p327/bin/ruby -I"lib:ext:bin:test" -I"/home/Tophe/.rvm/gems/ruby-1.9.3-p327@global/gems/rake-10.0.2/lib"
"/home/Tophe/.rvm/gems/ruby-1.9.3-p327@global/gems/rake-10.0.2/lib/rake/rake_test_loader.rb"
"test/test_socket_transport.rb" "test/test_io.rb" "test/test_datafile.rb" "test/test_help.rb"
"test/test_protocol.rb" 
Run options: 

# Running tests:

...F............................

Finished tests in 0.186894s, 171.2203 tests/s, 3798.9507 assertions/s.

  1) Failure:
test_utf8(TestDataFile) [/home/Tophe/work/svn_1/trunk/lang/ruby/test/test_datafile.rb:156]:
<"家"> expected but was
<"\xE5">.

32 tests, 710 assertions, 1 failures, 0 errors, 0 skips
rake aborted!

and no in 1.8.7

Tophe@info3:~/work/svn_1/trunk/lang/ruby$ rvm use 1.8.7
Using /home/Tophe/.rvm/gems/ruby-1.8.7-p371
Tophe@info3:~/work/svn_1/trunk/lang/ruby$ rake test
/home/Tophe/work/svn_1/trunk/lang/ruby/Rakefile:19: warning: already initialized constant
VERSION
/home/Tophe/.rvm/rubies/ruby-1.8.7-p371/bin/ruby -I"lib:ext:bin:test" -I"/home/Tophe/.rvm/gems/ruby-1.8.7-p371@global/gems/rake-10.0.2/lib"
"/home/Tophe/.rvm/gems/ruby-1.8.7-p371@global/gems/rake-10.0.2/lib/rake/rake_test_loader.rb"
"test/test_socket_transport.rb" "test/test_io.rb" "test/test_datafile.rb" "test/test_help.rb"
"test/test_protocol.rb" 
Loaded suite /home/Tophe/.rvm/gems/ruby-1.8.7-p371@global/gems/rake-10.0.2/lib/rake/rake_test_loader
Started
................................
Finished in 0.379195 seconds.

32 tests, 710 assertions, 0 failures, 0 errors

it seems that string.size, return the caracter count in ruby > 1.9, and not the byte count
as in ruby < 1.9
the patch correct that and work for all rubies .
surely it can work with jruby, but need to remove yajl, ruby json perhaps can do the job ?
and we can use avro in jruby with the avro gem.
Or yajl can be an option, if the require work it can be used, if not present can use JSON.load,dump.




                
> utf-8 serialisation problems 
> -----------------------------
>
>                 Key: AVRO-1206
>                 URL: https://issues.apache.org/jira/browse/AVRO-1206
>             Project: Avro
>          Issue Type: Bug
>          Components: ruby
>    Affects Versions: 1.7.2
>         Environment: ruby-1.9.3p194, avro gem 1.7.2.
>            Reporter: Tophe Vigny
>         Attachments: AVRO-1206.patch
>
>
> some serialized utf-8 characters like "家" cannot be read latter, avro break with 
> /gems/ruby-1.9.3-p194/gems/avro-1.7.2/lib/avro/io.rb:230:in `match_schemas': undefined
method `type' for nil:NilClass (NoMethodError)
> 	from /home/Tophe/.rvm/gems/ruby-1.9.3-p194/gems/avro-1.7.2/lib/avro/io.rb:288:in `read_data'
> 	from /home/Tophe/.rvm/gems/ruby-1.9.3-p194/gems/avro-1.7.2/lib/avro/io.rb:384:in `read_union'
> 	from /home/Tophe/.rvm/gems/ruby-1.9.3-p194/gems/avro-1.7.2/lib/avro/io.rb:317:in `read_data'
> 	from /home/Tophe/.rvm/gems/ruby-1.9.3-p194/gems/avro-1.7.2/lib/avro/io.rb:392:in `block
in read_record'
> 	from /home/Tophe/.rvm/gems/ruby-1.9.3-p194/gems/avro-1.7.2/lib/avro/io.rb:390:in `each'
> 	from /home/Tophe/.rvm/gems/ruby-1.9.3-p194/gems/avro-1.7.2/lib/avro/io.rb:390:in `read_record'
> 	from /home/Tophe/.rvm/gems/ruby-1.9.3-p194/gems/avro-1.7.2/lib/avro/io.rb:318:in `read_data'
> 	from /home/Tophe/.rvm/gems/ruby-1.9.3-p194/gems/avro-1.7.2/lib/avro/io.rb:283:in `read'
> 	from /home/Tophe/.rvm/gems/ruby-1.9.3-p194/gems/avro-1.7.2/lib/avro/data_file.rb:223:in
`block in each'
> 	from /home/Tophe/.rvm/gems/ruby-1.9.3-p194/gems/avro-1.7.2/lib/avro/data_file.rb:211:in
`loop'
> 	from /home/Tophe/.rvm/gems/ruby-1.9.3-p194/gems/avro-1.7.2/lib/avro/data_file.rb:211:in
`each'
> 	from avr_err_example.rb:42:in `block in <main>'

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message