Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Sat, 25 Jul 2015 20:07:04 +0000 (UTC)
From: "eugeny birukov (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.12849109.1437844869000.290916.1437854824526@Atlassian.JIRA>
In-Reply-To: <JIRA.12849109.1437844869000@Atlassian.JIRA>
References: <JIRA.12849109.1437844869000@Atlassian.JIRA>
 <JIRA.12849109.1437844869234@arcas>
Subject: [jira] [Updated] (HIVE-11373) Incorrect  (de)serialization STRING
 field to MAP<STRING,STRING> in TRANSFORM operation
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/HIVE-11373?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:all-tabpanel ]

eugeny birukov updated HIVE-11373:
----------------------------------
    Description:=20
I try transform json string to Map<STRING,STRING> using python code:

for d in sys.stdin:
 r=3Dre.sub('[:,]', '\003', re.sub('[{}\"]','',d))
 print r.strip()

echo '{"key1":"valu1","key2":"value2"}' | ./json2map.py=20
key1=03valu1=03key2=03value2

It's string must transform to HIVE type MAP<STRING,STRING>
But  transformation result view as  {"key1":"valu1\u0003key2\u0003value2"}

With one key-value entry work fine:

hive> SELECT TRANSFORM ('{"key1":"valu1"}') USING 's3://webgames-emr/hive/r=
estore/json2map.py' AS (parsedjson MAP<STRING, STRING>) FROM json;         =
      =20
...
{"key1":"valu1"}
Time taken: 35.177 seconds, Fetched: 1 row(s)

With many key-value entry work incorrect:

hive> SELECT TRANSFORM ('{"key1":"valu1","key2":"value2"}') USING 's3://web=
games-emr/hive/restore/json2map.py' AS (parsedjson MAP<STRING, STRING>) FRO=
M json;
...
{"key1":"valu1\u0003key2\u0003value2"}
Time taken: 33.486 seconds, Fetched: 1 row(s)

Steps for full reproduce:

echo '{"key1":"valu1","key2":"value2"}' > /tmp/json.txt;

hive -e "CREATE TABLE json(jsonStr STRING); load data local inpath '/tmp/js=
on.txt' overwrite into table json;"

hive -e "SELECT TRANSFORM (jsonStr) USING 's3://webgames-emr/hive/restore/j=
son2map.py' AS (parsedjson MAP<STRING, STRING>) FROM json;"

converting to local s3://webgames-emr/hive/restore/json2map.py
Added resources: [s3://webgames-emr/hive/restore/json2map.py]
Query ID =3D hadoop_20150725150000_46c48f7d-92c6-41d7-9c54-a90d5b351722
Total jobs =3D 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job =3D job_1437833808701_0006, Tracking URL =3D http://ip-172-31-=
11-47.ec2.internal:20888/proxy/application_1437833808701_0006/
Kill Command =3D /usr/lib/hadoop/bin/hadoop job  -kill job_1437833808701_00=
06
Hadoop job information for Stage-1: number of mappers: 1; number of reducer=
s: 0
2015-07-25 15:01:16,773 Stage-1 map =3D 0%,  reduce =3D 0%
2015-07-25 15:01:34,319 Stage-1 map =3D 100%,  reduce =3D 0%, Cumulative CP=
U 1.96 sec
MapReduce Total cumulative CPU time: 1 seconds 960 msec
Ended Job =3D job_1437833808701_0006
MapReduce Jobs Launched:=20
Stage-Stage-1: Map: 1   Cumulative CPU: 1.96 sec   HDFS Read: 261 HDFS Writ=
e: 25 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 960 msec
OK
{"key1":"valu1\u0003key2\u0003value2"}
Time taken: 48.878 seconds, Fetched: 1 row(s)

Expected Result {"key1":"valu1","key2":"value2"}

Actual Result {"key1":"valu1\u0003key2\u0003value2"}

  was:
I try transform json string to Map<STRING,STRING> using python code:

for d in sys.stdin:
 r=3Dre.sub('[:,]', '\003', re.sub('[{}\"]','',d))
 print r.strip()

echo '{"key1":"valu1","key2":"value2"}' | ./json2map.py=20
key1=03valu1=03key2=03value2

It's string must transform to HIVE type MAP<STRING,STRING>
But  transformation result view as  {"key1":"valu1\u0003key2\u0003value2"}

Steps for reproduce:

echo '{"key1":"valu1","key2":"value2"}' > /tmp/json.txt;

hive -e "CREATE TABLE json(jsonStr STRING); load data local inpath '/tmp/js=
on.txt' overwrite into table json;"

hive -e "SELECT TRANSFORM (jsonStr) USING 's3://webgames-emr/hive/restore/j=
son2map.py' AS (parsedjson MAP<STRING, STRING>) FROM json;"

converting to local s3://webgames-emr/hive/restore/json2map.py
Added resources: [s3://webgames-emr/hive/restore/json2map.py]
Query ID =3D hadoop_20150725150000_46c48f7d-92c6-41d7-9c54-a90d5b351722
Total jobs =3D 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job =3D job_1437833808701_0006, Tracking URL =3D http://ip-172-31-=
11-47.ec2.internal:20888/proxy/application_1437833808701_0006/
Kill Command =3D /usr/lib/hadoop/bin/hadoop job  -kill job_1437833808701_00=
06
Hadoop job information for Stage-1: number of mappers: 1; number of reducer=
s: 0
2015-07-25 15:01:16,773 Stage-1 map =3D 0%,  reduce =3D 0%
2015-07-25 15:01:34,319 Stage-1 map =3D 100%,  reduce =3D 0%, Cumulative CP=
U 1.96 sec
MapReduce Total cumulative CPU time: 1 seconds 960 msec
Ended Job =3D job_1437833808701_0006
MapReduce Jobs Launched:=20
Stage-Stage-1: Map: 1   Cumulative CPU: 1.96 sec   HDFS Read: 261 HDFS Writ=
e: 25 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 960 msec
OK
{"key1":"valu1\u0003key2\u0003value2"}
Time taken: 48.878 seconds, Fetched: 1 row(s)

Expected Result {"key1":"valu1","key2":"value2"}

Actual Result {"key1":"valu1\u0003key2\u0003value2"}


> Incorrect  (de)serialization STRING field to MAP<STRING,STRING> in TRANSF=
ORM operation
> -------------------------------------------------------------------------=
-------------
>
>                 Key: HIVE-11373
>                 URL: https://issues.apache.org/jira/browse/HIVE-11373
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.13.1, 1.0.0
>         Environment: Amazon EMR (AMI 3.8 with HIVE 0.13.1, emr-4.0.0 with=
 HIVE 1.0)
>            Reporter: eugeny birukov
>
> I try transform json string to Map<STRING,STRING> using python code:
> for d in sys.stdin:
>  r=3Dre.sub('[:,]', '\003', re.sub('[{}\"]','',d))
>  print r.strip()
> echo '{"key1":"valu1","key2":"value2"}' | ./json2map.py=20
> key1=03valu1=03key2=03value2
> It's string must transform to HIVE type MAP<STRING,STRING>
> But  transformation result view as  {"key1":"valu1\u0003key2\u0003value2"=
}
> With one key-value entry work fine:
> hive> SELECT TRANSFORM ('{"key1":"valu1"}') USING 's3://webgames-emr/hive=
/restore/json2map.py' AS (parsedjson MAP<STRING, STRING>) FROM json;       =
        =20
> ...
> {"key1":"valu1"}
> Time taken: 35.177 seconds, Fetched: 1 row(s)
> With many key-value entry work incorrect:
> hive> SELECT TRANSFORM ('{"key1":"valu1","key2":"value2"}') USING 's3://w=
ebgames-emr/hive/restore/json2map.py' AS (parsedjson MAP<STRING, STRING>) F=
ROM json;
> ...
> {"key1":"valu1\u0003key2\u0003value2"}
> Time taken: 33.486 seconds, Fetched: 1 row(s)
> Steps for full reproduce:
> echo '{"key1":"valu1","key2":"value2"}' > /tmp/json.txt;
> hive -e "CREATE TABLE json(jsonStr STRING); load data local inpath '/tmp/=
json.txt' overwrite into table json;"
> hive -e "SELECT TRANSFORM (jsonStr) USING 's3://webgames-emr/hive/restore=
/json2map.py' AS (parsedjson MAP<STRING, STRING>) FROM json;"
> converting to local s3://webgames-emr/hive/restore/json2map.py
> Added resources: [s3://webgames-emr/hive/restore/json2map.py]
> Query ID =3D hadoop_20150725150000_46c48f7d-92c6-41d7-9c54-a90d5b351722
> Total jobs =3D 1
> Launching Job 1 out of 1
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job =3D job_1437833808701_0006, Tracking URL =3D http://ip-172-3=
1-11-47.ec2.internal:20888/proxy/application_1437833808701_0006/
> Kill Command =3D /usr/lib/hadoop/bin/hadoop job  -kill job_1437833808701_=
0006
> Hadoop job information for Stage-1: number of mappers: 1; number of reduc=
ers: 0
> 2015-07-25 15:01:16,773 Stage-1 map =3D 0%,  reduce =3D 0%
> 2015-07-25 15:01:34,319 Stage-1 map =3D 100%,  reduce =3D 0%, Cumulative =
CPU 1.96 sec
> MapReduce Total cumulative CPU time: 1 seconds 960 msec
> Ended Job =3D job_1437833808701_0006
> MapReduce Jobs Launched:=20
> Stage-Stage-1: Map: 1   Cumulative CPU: 1.96 sec   HDFS Read: 261 HDFS Wr=
ite: 25 SUCCESS
> Total MapReduce CPU Time Spent: 1 seconds 960 msec
> OK
> {"key1":"valu1\u0003key2\u0003value2"}
> Time taken: 48.878 seconds, Fetched: 1 row(s)
> Expected Result {"key1":"valu1","key2":"value2"}
> Actual Result {"key1":"valu1\u0003key2\u0003value2"}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)