Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 48FFC10507 for ; Sat, 1 Nov 2014 23:44:34 +0000 (UTC) Received: (qmail 66992 invoked by uid 500); 1 Nov 2014 23:44:33 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 66922 invoked by uid 500); 1 Nov 2014 23:44:33 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 66911 invoked by uid 500); 1 Nov 2014 23:44:33 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 66908 invoked by uid 99); 1 Nov 2014 23:44:33 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 Nov 2014 23:44:33 +0000 Date: Sat, 1 Nov 2014 23:44:33 +0000 (UTC) From: "Xiaobing Zhou (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-7511) Hive: output is incorrect if there are UTF-8 characters in where clause of a hive select query. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-7511?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D14193= 589#comment-14193589 ]=20 Xiaobing Zhou commented on HIVE-7511: ------------------------------------- This can be resolved by applying java options, like -Dfile.encoding=3DUTF-8= . Setting it as env variable(_JAVA_OPTIONS=3D-Dfile.encoding=3DUTF-8) or pa= ssing as java start argument both work fine. > Hive: output is incorrect if there are UTF-8 characters in where clause o= f a hive select query. > -------------------------------------------------------------------------= ---------------------- > > Key: HIVE-7511 > URL: https://issues.apache.org/jira/browse/HIVE-7511 > Project: Hive > Issue Type: Bug > Affects Versions: 0.13.0 > Environment: Windows Server 2008 R2 > Reporter: Xiaobing Zhou > Assignee: Xiaobing Zhou > Priority: Critical > Attachments: HIVE-7511.1.patch > > > When we put UTF-8 characters in where clause of a hive query the results = are empty for "where content like '%=E4=B8=84%'" and results contain all ro= ws for "where content not like '%=E4=B8=84%';" even when few rows contain t= his character. > Steps to reproduce: > 1. Save a file called data.txt in the root container. The contents of the= files are as follows. > 190=09=E4=B8=84f=E9=BD=84=E5=95=8Ac=E7=8B=9B=E4=B6=B4h=E4=B6=B4c=E7=8B=9D > 899=09d=E7=8B=9C=E7=8B=9C=E3=90=81geg=E9=98=BF=E7=8B=9Aea=E4=B6=B4eead=E7= =8B=9Ce > 137=09=E9=BD=84=E9=BC=BEh=E7=8B=9Dge=E3=90=80=E7=8B=9Bg=E7=8B=9A=E9=98=BF > 21=09=EF=A8=A9=EF=A8=A9e=E3=90=80c=E7=8B=9B=E9=BC=BEd=E4=B6=B4=EF=A8=A8 > 767=09=EF=A8=A9c=EF=A8=A9g=E7=8B=9C=E3=90=81=E7=8B=9C=E7=8B=9B=E9=BD=84= =E9=98=BF=EF=A8=A9=E7=8B=9A=E9=BD=84=EF=A8=A8=E4=B6=B5=E7=8B=9D=EF=A8=A8 > 281=09=EF=A8=A8=E3=90=80=E5=95=8Aaga=E5=95=8Ac=E7=8B=9De=E9=BC=BE=E9=BC= =BE > 573=09=E3=90=81=E4=B6=B4hc=EF=A8=A8b=E7=8B=9D=E3=90=81=EF=A8=A9=E4=B6=B4= =E7=8B=9C=E4=B8=84hc=E9=BD=84 > 966=09=E4=B6=B4=E4=B8=84=E7=8B=9C=EF=A8=A8e=E7=8B=9Deb=E7=8B=9C=E3=90=81c= =E3=90=80=E9=BC=BE=EF=A8=A9=E4=B8=84ga=E7=8B=9A=E4=B8=84 > 565=09=E4=B6=B5=E3=90=80=EF=A8=A9=E3=90=80bb=E7=8B=9Behd=E4=B8=84ea=E4=B8= =84=E3=90=80 > 778=09=EF=A8=A9=E3=90=81=E9=98=BF=EF=A8=A8=E7=8B=9Abbea=E4=B8=84=E4=B6=B5= =E4=B8=84=E7=8B=9A=E9=BC=BE=E7=8B=9Aa=E4=B6=B5 > 363=09gd=E9=BD=84a=E9=BC=BEa=E4=B6=B4b=E3=90=81=E3=90=81fg=E9=BC=BE > 822=09a=E9=98=BF=E7=8B=9C=E4=B6=B5h=E4=B6=B5e=E7=8B=9Bh=EF=A8=A9gac=E7=8B= =9C=E9=98=BF=E3=90=80=E5=95=8Ab > 338=09b=E9=BD=84=E3=90=81ff=E9=98=BFe=E7=8B=9Ce=E3=90=80ba=E9=BD=84 > 2. Execute the following queries to setup the table. > a. CREATE TABLE hivetable(row INT, content STRING) ROW FORMAT DELIMITED F= IELDS TERMINATED BY ' > t' LOCATION '/hivetable'; > b. LOAD DATA INPATH 'wasb:///data.txt' OVERWRITE INTO TABLE hivetable; > 3. create a query file query.hql with following contents > INSERT OVERWRITE DIRECTORY 'wasb:///hiveoutput' > select * from hivetable where content like '%=E4=B8=84%'; > 4. even though few rows contains this character the output is empty. > 5. change the contents of query.hql to=20 > INSERT OVERWRITE DIRECTORY 'wasb:///hiveoutput' > select * from hivetable where content not like '%=E4=B8=84%'; > 6. The output contains all rows including those containing the given char= acter. > 7. Similar results are observed when using "where content =3D '=E4=B8=84f= =E9=BD=84=E5=95=8Ac=E7=8B=9B=E4=B6=B4h=E4=B6=B4c=E7=8B=9D'; " > 8. We get expected results when using "where content like '%a%'; " -- This message was sent by Atlassian JIRA (v6.3.4#6332)