hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Xu (JIRA)" <>
Subject [jira] Commented: (HIVE-1505) Support non-UTF8 data
Date Tue, 24 Aug 2010 07:05:23 GMT


Ted Xu commented on HIVE-1505:

Thanks Edward.

I dug into the problem and found the patch will not working when the query have subqueries,
it is very hard to retain encoding information in those queries.

Table properties may miss in queries, the problem is the same as missing field delimiter setting,
because whenever hive can't get table properties in subquery (e.g., join operation), the default
value is used (^A for field delimiter, that's why the deserializer will fail most of the time
when data contains ^A character even if ^A is not set for field delimiter).


> Support non-UTF8 data
> ---------------------
>                 Key: HIVE-1505
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>    Affects Versions: 0.5.0
>            Reporter: bc Wong
>            Assignee: Ted Xu
>         Attachments: trunk-encoding.patch
> I'd like to work with non-UTF8 data easily.
> Suppose I have data in latin1. Currently, doing a "select *" will return the upper ascii
characters in '\xef\xbf\xbd', which is the replacement character '\ufffd' encoded in UTF-8.
Would be nice for Hive to understand different encodings, or to have a concept of byte string.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message