hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zoltan Ivanfi (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-14846) Char encoding does not apply to newline chars
Date Tue, 27 Sep 2016 12:59:20 GMT
Zoltan Ivanfi created HIVE-14846:
------------------------------------

             Summary: Char encoding does not apply to newline chars
                 Key: HIVE-14846
                 URL: https://issues.apache.org/jira/browse/HIVE-14846
             Project: Hive
          Issue Type: Bug
    Affects Versions: 1.1.0
            Reporter: Zoltan Ivanfi
            Priority: Minor


I created and populated a table with utf-16 encoding:

    hive> create external table utf16 (col1 timestamp, col2 string) row format delimited
fields terminated by "," location '/tmp/utf16';
    hive> alter table utf16 set serdeproperties ('serialization.encoding'='UTF-16');
    hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');

Then I checked the contents of the file:

    $ hadoop fs -cat /tmp/utf16/000000_0 | hd
    00000000  fe ff 00 32 00 30 00 31  00 30 00 2d 00 30 00 31  |...2.0.1.0.-.0.1|
    00000010  00 2d 00 30 00 34 00 20  00 30 00 30 00 3a 00 30  |.-.0.4. .0.0.:.0|
    00000020  00 30 00 3a 00 30 00 30  00 2c 00 63 00 69 00 70  |.0.:.0.0.,.c.i.p|
    00000030  01 51 0a                                          |.Q.|
    00000033

The newline character is represented as 0a instead of the expected 00 0a.

If I do it the other way around and put correct UTF-16 files into HDFS and try to query them
from Hive, I get unknown unicode chars in the output:

    hive> select * from utf16;
    2010-01-01 00:00:00	hőség�
    2010-01-02 00:00:00	város�
    2010-01-03 00:00:00	füzet�




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message