hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuntao Jia (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-640) Add LazyBinarySerDe to Hive
Date Fri, 17 Jul 2009 17:31:14 GMT

    [ https://issues.apache.org/jira/browse/HIVE-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732621#action_12732621

Yuntao Jia commented on HIVE-640:

Here is my initial proposal for the Lazy Binary SerDe. It should have the following properties:
1/ Lazy, which means the real fields are not deserialized until accessed, just like SimpleLazySerDe.

2/ Binary, which means the data are stored in the compact binary format. However it is different
from BinarySortable that the stored data does not preserve the orders of the original data.
More specifications on how different data types are stored are described below.
2.1/	Null fields in a row. To represent that, we use a single bit to represent whether each
filed is null or not. 0b means null and 1b means not. Eight bits forms a byte, if there are
less than eight bytes at the end, we use one more byte (8 bits). They are stored at the beginning
of each row. Take a 10 columns table for an example, we begin each row with two bytes(16 bits).
If in one row, the first column and the 10th column are null, then we will store 01111111b
and 10111111b, which are 127 and 191 decimal numbers.
2.2/	Null fields in container types and complex types, such as list, map and struct. Similarly,
we use a single bit to represent whether each element is null or not. For recursive data,
such as a list of list, we store those bytes at each level. We store some bytes at the beginning
of the list to indicate whether each list element is null. At the beginning of each list element,
which is another list, we store some bytes too to indicate whether its elements are null or
2.3/	For elements that are null, we do not store them. 
2.4/	For int and long primary types, we store them with the varied sized int and varied size
long, such as vint and vlong in the WritableUtils in hadoop. 
2.5/	For other primary types, including double, float, Boolean, byte and short, store them
in binary format. For example, Boolean takes one byte and double takes eight bytes.
2.6/	For String, we first store its size as an vint, then followed by all the string bytes.
For an empty string, we just store it size.
2.7/	For List, we first store its size as an vint, then followed by the bytes representing
whether the fields are null or not. Then the real elements are stored. For an empty list,
we just store its size.
2.8/	For Map, we first store its size as an vint, then followed by the bytes representing
whether the keys and values are null or not. Each pair of key and value requires two consecutive
bits. So there are twice as many bits as the size of the map. The key-value pairs are stored
afterwards. For an empty map, we just store its size.
2.9/	For Struct, we first store the bytes representing whether each filed is null or not.
The we will store the real data fields.
3/ We will use the standard writable object inspector.
4/ We will use the BytesWritable class as the serialization class.

> Add LazyBinarySerDe to Hive
> ---------------------------
>                 Key: HIVE-640
>                 URL: https://issues.apache.org/jira/browse/HIVE-640
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Yuntao Jia
> LazyBinarySerDe will serialize the data in binary format while supporting LazyDeserialization.
> This will be used as the SerDe for value between map and reduce, and also between different
map-reduce jobs.
> This will help improve the performance of Hive a lot.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message