hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "BELUGA BEHR (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-21240) JSON SerDe Deserialize Re-Write
Date Wed, 13 Feb 2019 13:41:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767202#comment-16767202
] 

BELUGA BEHR edited comment on HIVE-21240 at 2/13/19 1:40 PM:
-------------------------------------------------------------

[~kgyrtkirk] Thank you for the review!!

# I think that it's best to let the JSON library, which is specialized in parsing, do the
work.  It certainly cuts down on code that Hive needs to maintain.  Yes, the tree has to be
built upfront, but as things currently stand, there are very few scenarios where the entire
tree isn't traversed.  I guess that if there is any unknown (extra) fields in the text, the
fields are ignored and the JSON string value is not parsed into its data type as things currently
stand.  I don't know that this is a scenario needs to be optimized for.  I have not seen many
situations where a customer actively wants to ignore a field.
# I think it's best to work with Java {{Collections}} over Java native arrays.  The only thing
that happens with the results of the SerDe is that they are iterated over.  The Qtest passing
bear this out, so I think starting with 4.0 it's a good time to make that change.
# I am sorry about the reformatted lines.  Since I was touching a lot of code in the JsonSerde,
I thought it may be helpful to clean up some check-style issues while I'm in there.  I will
revert.  Thank you for pointing me at the HIve formatter, I have been using the Hadoop formatter
for a year+.

Thanks again!


was (Author: belugabehr):
[~kgyrtkirk] Thank you for the review!!

# I think that it's best to let the JSON library, which is specialized in parsing, do the
work.  It certainly cuts down on code that Hive needs to maintain.  Yes, the tree has to be
built upfront, but as things currently stand, there are very few scenarios where the entire
tree isn't traversed.  I guess that if there is an "unknown" field in the text, the field
is ignored and the JSON string value is not parsed into its data type as things currently
stand.  I don't know that this is a scenario needs to be optimized for.  I have not seen many
situations where a customer actively wants to ignore a field.
# I think it's best to work with Java {{Collections}} over Java native arrays.  The only thing
that happens with the results of the SerDe is that they are iterated over.  The Qtest passing
bear this out, so I think starting with 4.0 it's a good time to make that change.
# I am sorry about the reformatted lines.  Since I was touching a lot of code in the JsonSerde,
I thought it may be helpful to clean up some check-style issues while I'm in there.  I will
revert.  Thank you for pointing me at the HIve formatter, I have been using the Hadoop formatter
for a year+.

Thanks again!

> JSON SerDe Deserialize Re-Write
> -------------------------------
>
>                 Key: HIVE-21240
>                 URL: https://issues.apache.org/jira/browse/HIVE-21240
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>    Affects Versions: 4.0.0, 3.1.1
>            Reporter: BELUGA BEHR
>            Assignee: BELUGA BEHR
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>         Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, HIVE-21240.2.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for each row
processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message