hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Krishna Kumar (JIRA)" <>
Subject [jira] [Updated] (HIVE-956) Add support of columnar binary serde
Date Wed, 08 Jun 2011 16:09:59 GMT


Krishna Kumar updated HIVE-956:

    Attachment: HIVE.956.patch.2

Re-ordering my responses;

bq. can LazyBinaryColumnarSerDe share some code with LazyBinarySerde?

	Yes, it does. For instance, the serialization of LazyBinaryColumnarSerde forwards to LazyBinarySerde.serialize
for each field (except for one special case of empty string described below). Similarly the
deserialization happens in both cases eventually via LazyBinaryXXX.init. The objectinspector
is the same, while the common parts of the object class (ColumnarStruct and BinaryColumnarStruct)
has been refactored into a ColumnarStructBase class.
bq. how warnedOnceNullMapKey is used?

	To enable the above (reuse of LazyBinarySerde.serialize()), I have made it a static method
of LazyBinarySerde. The only member variable which was used in that method was a boolean which
was used to issue a LOG warning the first time a null map key is encountered. So, I made that
into a parameter so that the existing behaviour for LazyBinarySerde is unchanged (that is
a warning is issued once per instance).
bq. why do you need to handle empty string specially? "serializeStream.write(INVALID_UTF__SINGLE_BYTE,
0, 1);" i thought for empty data, we just store data length 0 in rcfile. 
I thought since there is no NULLSequence in the new serde, the null should be handled specially.
 i am missing sth, How do you handle null here?	

	As you know, column values' length are stored in the key part of rcfile (after run-length
encoding, and an optional compressing). A 0 in this recorded length is used as the null indicator.
This means that non-null values should occupy one or more bytes when serialized. That was
ok with the original LazyBinarySerde.serialize, as primitive numeric types, strings (with
their datalength) and complex types (with their datalength) do occupy non-zero bytes. But
this is too much redundancy and overhead for the typical case (non-empty strings), so I added
an extra parameter "skipLengthPrefix" which skips prefixing string/list/map/struct types with
a length prefix. But with this, empty strings become a problem since they need to differentiated
from nulls. So I used this special single-byte marker for denoting empty strings. (As a side
note, for completeness' sake, I should point out that an instance of a struct which has no
fields will be encoded with zero bytes. But this is not allowed by the language so I think
we are fine here.)

bq. is getLength() only for null check? if yes, can you call it 'isNull()'? And if the only
difference in ColumnarStruct and BinaryColumnarStruct is null check, just curious, how difficult
is it to avoid this new BinaryColumnarStruct class?

	See above. In general, the length recorded in the key part of rcfile reflects the length
of the bytesequence with which the lazyobject should be initialized. The only exception is
in the case of empty strings, where the recorded length is 1 (the special marker), but the
lazyobject needs be initialized with a 0-length byte sequence. 	
	Recorded length being 0 indicates nulls for lazybinarycolumnar and data being the nullsequence
indicates null for lazybinary. The difference between ColumnarStruct and BinaryColumnarStruct
is this length/null handling, and the object creation itself, which are now the abstract methods
of the common base class.

bq. originally the map comparison is not supported, but this patch added a mapEqualComparer.
can we put this in a separate patch? It seems the logic in CrossMapEqualComparer is not correct.
(how do you make sure you will get the keys from a map in some kind of same order?)

	I put this in this patch since I needed that for the tests that I had added. Do you think
I should create a dependant jira and extract this part of the patch to that jira?
	Hmm, the logic in crossmapequalcomparer looks ok to me (given the caveats mentioned in the
javadoc about broken transitivity of greater-than/less-than.) I am not accessing the keys
in tandem, but in a nested loop. Since the number of keys are the same, and the keys are unique,
both keys and values matching (as declared by is taken as a
match for that pair of key-value pairs.
bq. can you add a 'toString()' for new binary columnar serde, just the same as columnar serde

	Done, and patch regenerated after rebasing.

> Add support of columnar binary serde
> ------------------------------------
>                 Key: HIVE-956
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Krishna Kumar
>         Attachments: HIVE.956.patch.0, HIVE.956.patch.1, HIVE.956.patch.2

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message