hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phabricator (JIRA)" <>
Subject [jira] [Commented] (HIVE-4199) ORC writer doesn't handle non-UTF8 encoded Text properly
Date Tue, 09 Apr 2013 22:26:16 GMT


Phabricator commented on HIVE-4199:

sxyuan has commented on the revision "HIVE-4199 [jira] ORC writer doesn't handle non-UTF8
encoded Text properly".

  Inline comments.

  ql/src/java/org/apache/hadoop/hive/ql/io/orc/ The reason why I
kept the add(String) method is that it can avoid doing two copies when the original data is
actually a String. If the dictionary only takes Text objects, the writer will have to convert
the String to a new Text object, and then set(Text) will copy the bytes over to the dictionary's
internal Text object.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/ I've looked into adding
statistics for non-UTF8 strings, but I discovered that the stats are serialized to Protobuf
objects which require strings to be UTF8 encoded. Do you have any suggestions?


To: kevinwilfong, sxyuan
Cc: JIRA, omalley

> ORC writer doesn't handle non-UTF8 encoded Text properly
> --------------------------------------------------------
>                 Key: HIVE-4199
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Samuel Yuan
>            Assignee: Samuel Yuan
>            Priority: Minor
>         Attachments: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch, HIVE-4199.HIVE-4199.HIVE-4199.D9501.2.patch,
HIVE-4199.HIVE-4199.HIVE-4199.D9501.3.patch, HIVE-4199.HIVE-4199.HIVE-4199.D9501.4.patch
> StringTreeWriter currently converts fields stored as Text objects into Strings. This
can lose information (see,
and is also unnecessary since the dictionary stores Text objects.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message