hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "xiaowei wang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-10983) SerDeUtils bug ,when Text is reused
Date Fri, 26 Jun 2015 11:54:04 GMT

     [ https://issues.apache.org/jira/browse/HIVE-10983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

xiaowei wang updated HIVE-10983:
--------------------------------
    Description: 
{noformat}
The mothod transformTextToUTF8 and transformTextFromUTF8  have a error bug,It invoke a bad
method of Text,getBytes()!
The method getBytes of Text returns the raw bytes; however, only data up to Text.length is
valid.A better way is  use copyBytes()  if you need the returned array to be precisely the
length of the data.
But the copyBytes is added behind hadoop1. 
{noformat}

When i query data from a lzo table , I found  in results : the length of the current row
is always largr  than the previous row, and sometimes,the current  row contains the contents
of the previous row。 For example ,i execute a sql ,
{code:sql}
select *   from web_searchhub where logdate=2015061003
{code}
the result of sql see blow.Notice that ,the second row content contains the first row content.
{noformat}
INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254
2015061003
INFO [03:00:05.594] <18941e66-9962-44ad-81bc-3519f47ba274> session=901,thread=223ession=3151,thread=254
2015061003
{noformat}

The content  of origin lzo file content see below ,just 2 rows.
{noformat}
INFO [03:00:05.635] <b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb> session=3148,thread=285
INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285
{noformat}

I think this error is caused by the Text reuse,and I found the solutions .

Addicational, table create sql is : 
{code:sql}
CREATE EXTERNAL TABLE `web_searchhub`(
  `line` string)
PARTITIONED BY (
  `logdate` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\\U0000'
WITH SERDEPROPERTIES (
  'serialization.encoding'='GBK')
STORED AS INPUTFORMAT  "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
          OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

LOCATION
  'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' 
{code}


  was:
{noformat}
The mothod transformTextToUTF8 have a error bug,It invoke a bad method of Text,getBytes()!
The method getBytes of Text returns the raw bytes; however, only data up to Text.length is
valid.A better way is  use copyBytes()  if you need the returned array to be precisely the
length of the data.
But the copyBytes is added behind hadoop1. 
{noformat}

When i query data from a lzo table , I found  in results : the length of the current row
is always largr  than the previous row, and sometimes,the current  row contains the contents
of the previous row。 For example ,i execute a sql ,
{code:sql}
select *   from web_searchhub where logdate=2015061003
{code}
the result of sql see blow.Notice that ,the second row content contains the first row content.
{noformat}
INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254
2015061003
INFO [03:00:05.594] <18941e66-9962-44ad-81bc-3519f47ba274> session=901,thread=223ession=3151,thread=254
2015061003
{noformat}

The content  of origin lzo file content see below ,just 2 rows.
{noformat}
INFO [03:00:05.635] <b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb> session=3148,thread=285
INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285
{noformat}

I think this error is caused by the Text reuse,and I found the solutions .

Addicational, table create sql is : 
{code:sql}
CREATE EXTERNAL TABLE `web_searchhub`(
  `line` string)
PARTITIONED BY (
  `logdate` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\\U0000'
WITH SERDEPROPERTIES (
  'serialization.encoding'='GBK')
STORED AS INPUTFORMAT  "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
          OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

LOCATION
  'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' 
{code}



> SerDeUtils bug  ,when Text is reused 
> -------------------------------------
>
>                 Key: HIVE-10983
>                 URL: https://issues.apache.org/jira/browse/HIVE-10983
>             Project: Hive
>          Issue Type: Bug
>          Components: API, CLI
>    Affects Versions: 0.14.0, 1.0.0, 1.2.0
>         Environment: Hadoop 2.3.0-cdh5.0.0
> Hive 0.14
>            Reporter: xiaowei wang
>            Assignee: xiaowei wang
>              Labels: patch
>             Fix For: 0.14.1, 1.2.0
>
>         Attachments: HIVE-10983.1.patch.txt, HIVE-10983.2.patch.txt, HIVE-10983.3.patch.txt,
HIVE-10983.4.patch.txt
>
>
> {noformat}
> The mothod transformTextToUTF8 and transformTextFromUTF8  have a error bug,It invoke
a bad method of Text,getBytes()!
> The method getBytes of Text returns the raw bytes; however, only data up to Text.length
is valid.A better way is  use copyBytes()  if you need the returned array to be precisely
the length of the data.
> But the copyBytes is added behind hadoop1. 
> {noformat}
> When i query data from a lzo table , I found  in results : the length of the current
row is always largr  than the previous row, and sometimes,the current  row contains the
contents of the previous row。 For example ,i execute a sql ,
> {code:sql}
> select *   from web_searchhub where logdate=2015061003
> {code}
> the result of sql see blow.Notice that ,the second row content contains the first row
content.
> {noformat}
> INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254
2015061003
> INFO [03:00:05.594] <18941e66-9962-44ad-81bc-3519f47ba274> session=901,thread=223ession=3151,thread=254
2015061003
> {noformat}
> The content  of origin lzo file content see below ,just 2 rows.
> {noformat}
> INFO [03:00:05.635] <b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb> session=3148,thread=285
> INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285
> {noformat}
> I think this error is caused by the Text reuse,and I found the solutions .
> Addicational, table create sql is : 
> {code:sql}
> CREATE EXTERNAL TABLE `web_searchhub`(
>   `line` string)
> PARTITIONED BY (
>   `logdate` string)
> ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY '\\U0000'
> WITH SERDEPROPERTIES (
>   'serialization.encoding'='GBK')
> STORED AS INPUTFORMAT  "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
>           OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> LOCATION
>   'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message