Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 83A30179A7 for ; Wed, 24 Jun 2015 13:31:05 +0000 (UTC) Received: (qmail 72899 invoked by uid 500); 24 Jun 2015 13:31:05 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 72820 invoked by uid 500); 24 Jun 2015 13:31:05 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 72747 invoked by uid 99); 24 Jun 2015 13:31:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jun 2015 13:31:04 +0000 Date: Wed, 24 Jun 2015 13:31:04 +0000 (UTC) From: "xiaowei wang (JIRA)" To: dev@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HIVE-11095) SerDeUtils another bug ,when Text is reused MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 xiaowei wang created HIVE-11095: ----------------------------------- Summary: SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 1.2.0, 1.0.0, 0.14.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Priority: Critical Fix For: 1.2.0 the method transformTextFromUTF8 have a bug,=20 When i query data from a lzo table =EF=BC=8C I found in results =EF=BC=9A t= he length of the current row is always largr than the previous row=EF=BC=8C= and sometimes=EF=BC=8Cthe current row contains the contents of the previou= s row=E3=80=82 For example =EF=BC=8Ci execute a sql ,"select * from web_sea= rchhub where logdate=3D2015061003", the result of sql see blow.Notice that = ,the second row content contains the first row content. INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=3D/10.13.193.68= :42098,session=3D3151,thread=3D254 2015061003 INFO [03:00:05.594] <18941e66-9962-44ad-81bc-3519f47ba274> session=3D901,th= read=3D223ession=3D3151,thread=3D254 2015061003 The content of origin lzo file content see below ,just 2 rows. INFO [03:00:05.635] session= =3D3148,thread=3D285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=3D/10.13.193.68= :42095,session=3D3148,thread=3D285 I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is :=20 CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U0000' WITH SERDEPROPERTIES ( 'serialization.encoding'=3D'GBK') STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' =EF=BC=9B -- This message was sent by Atlassian JIRA (v6.3.4#6332)