hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward Capriolo (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2333) LazySimpleSerDe does not properly handle arrays / escape control characters
Date Sat, 06 Jul 2013 14:51:49 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701362#comment-13701362
] 

Edward Capriolo commented on HIVE-2333:
---------------------------------------

{quote}
 I'm not sure what the best way to express the contract is but certainly if a serde does not
support a certain condition, at the very least, a 
warning needs to be shown.{quote}
I agree. We also need to be clear what the contract is and draw a matrix of what the current
serde's do.

The lazy simple is the standard serde and it has been around for a long time. Unless it has
broken recently we might be best making a new serde and letting that be the default for new
tables. I suggest this because someone is likely dependent on the current behaviour. 

{quote}
I propose we should escape the delimiters always, irrespective of whether it is configured
or not.
{quote}
So if you escape the delimiter does it abide by the contract?
                
> LazySimpleSerDe does not properly handle arrays / escape control characters
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-2333
>                 URL: https://issues.apache.org/jira/browse/HIVE-2333
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Jonathan Chang
>            Priority: Critical
>
> LazySimpleSerDe, the default SerDe for Hive is severely broken:
> * Empty arrays are serialized as an empty string. Hence an array(array()) is indistinguishable
from array(array(array())) from array().
> * Similarly, empty strings are serialized as an empty string. Hence array('') is also
indistinguishable from an empty array.
> * if the serialized string equals the null sequence, then it is ambiguous as to whether
it is an array with a single null element or a null array.
> It also does not do well with control characters:
> > select array('foo\002bar') from tmp;
> ...
> ["foo","bar"]
> > select array('foo\001bar') from tmp;
> ...
> ["foo"]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message