impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Branislav Lukáč (JIRA) <j...@apache.org>
Subject [jira] [Created] (IMPALA-5675) Wrong results when querying tables with CHAR/VARCHAR datatypes
Date Tue, 18 Jul 2017 09:24:00 GMT
Branislav Lukáč created IMPALA-5675:
---------------------------------------

             Summary: Wrong results when querying tables with CHAR/VARCHAR datatypes
                 Key: IMPALA-5675
                 URL: https://issues.apache.org/jira/browse/IMPALA-5675
             Project: IMPALA
          Issue Type: Bug
    Affects Versions: Impala 2.7.0
         Environment: Cloudera distro 5.10.1
            Reporter: Branislav Lukáč
         Attachments: Hive_query.png, Impala_query.png

We have created external table with the following query:

CREATE EXTERNAL TABLE IF NOT EXISTS SAPNSQ.ZAP_GL_EX_IM_CSV ( GLREQUEST DECIMAL(30), KNUMC
STRING, FACCP STRING, FCHAR VARCHAR(20), FCLNT VARCHAR(3), FCUKY STRING, FCURR DOUBLE, FDATS
STRING, FDEC DECIMAL(8, 2), FFLTP FLOAT, FINT1 TINYINT, FINT2 SMALLINT, FINT4 BIGINT, FLANG
STRING, FPREC DOUBLE, FQUAN DOUBLE, FTIMS STRING, FUNIT STRING, FSSTRING STRING, FCHAR40 VARCHAR(40)
) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS TEXTFILE LOCATION "hdfs:///user/nsqhdp/H_CDC_IMPQ/ZAP_GL_EX_IM_CSV/63E55F5943E95122E1000000C0A83051"
 
CSV files are already present on specified location hdfs:///user/nsqhdp/H_CDC_IMPQ/ZAP_GL_EX_IM_CSV/63E55F5943E95122E1000000C0A83051
 
When we execute Select fchar40 FROM sapnsq.zap_gl_ex_im_csv ORDER BY fchar40 with both Hive
and Impala, we get different results:
- Hive (see Hive_query.png)
- Impala (see Impala_query.png)

Seems that Impala engine is truncating strings when they contain non-ASCII characters.
So if a character is encoded with 2 bytes, Impala counts it as 2 chars (instead of 1).
Then the  FCHAR40 VARCHAR(40) will actually return less than 40 characters.
 
Example:
1st row contains 3 special characters: É, Ï and ü
Select with Impala truncates the result by 3 characters.

According to Impala documentation (https://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_varchar.html),
Unicode should be supported:
"All data in CHAR and VARCHAR columns must be in a character encoding that is compatible with
UTF-8"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message