hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szehon Ho (JIRA)" <>
Subject [jira] [Commented] (HIVE-3245) UTF encoded data not displayed correctly by Hive driver
Date Mon, 09 Dec 2013 19:38:07 GMT


Szehon Ho commented on HIVE-3245:

[~the6campbells] I thought that initially too, but from my observation there is no code that
specifies encoding in the JDBC driver.

As far as i can tell, in the read case, we need to specify encoding in only two places, none
of which are in JDBC driver.
1. when we construct the string from input bytes (done on hive-server2)
2. when we attempt to display the string using Java PrintStream (done in consuming java application,
like beeline)

The driver receives each column value from Hive-Server2 already in the form of a Thrift string,
and passes it on as is to the application as is, when resultSet.getString() is called.  That
is different from old JDBC driver in which there was code (as pointed out by Mark Grover),
that did do some decoding and re-encoding.  Thats why I don't see anything needed to be done
now at JDBC layer.  Let me know if I am missing something.

My comment about making sure your java application is properly configured addresses the final
display of the string (2nd point above).

> UTF encoded data not displayed correctly by Hive driver
> -------------------------------------------------------
>                 Key: HIVE-3245
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: JDBC
>    Affects Versions: 0.8.0
>            Reporter: N Campbell
>            Assignee: Szehon Ho
>         Attachments: ASF.LICENSE.NOT.GRANTED--screenshot-1.jpg, CERT.TLJA.txt
> various foreign language data (i.e. japanese, thai etc) is loaded into string columns
via tab delimited text files. A simple projection of the columns in the table is not displaying
the correct data. Exporting the data from Hive and looking at the files implies the data is
loaded properly. it appears to be an encoding issue at the driver but unaware of any required
URL connection properties re encoding that Hive JDBC requires.
> create table if not exists CERT.TLJA_JP_E ( RNUM int , C1 string, ORD int)
> row format delimited
> fields terminated by '\t'
> stored as textfile;
> create table if not exists CERT.TLJA_JP ( RNUM int , C1 string, ORD int)
> stored as sequencefile;
> load data local inpath '/home/hadoopadmin/jdbc-cert/CERT/CERT.TLJA_JP.txt'
> overwrite into table CERT.TLJA_JP_E;
> insert overwrite table CERT.TLJA_JP  select * from CERT.TLJA_JP_E;

This message was sent by Atlassian JIRA

View raw message