hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yongqiang He <>
Subject Re: Converting types from java HashMap, Long and Array to BytesWritable for RCFileOutputFormat
Date Thu, 10 Jun 2010 07:23:46 GMT
Please see inline comment.
please correct me if I am wrong about the serde layer.

On 6/9/10 11:24 PM, "Viraj Bhat" <> wrote:

> Hi Yongqiang and Hive users,
>  In my Map Reduce program I have HashMap's and Array of HashMap's, which
> I need to convert to BytesRefWritable for using the RCFileOutputFormat
> (which uses values as BytesRefWritable). I am then planning to re-read
> this data using the "ROW FORMAT SERDE
> "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
> Here are the questions I have about the steps to be followed:
> 1) Should I take the columnarserde code and write my own serde since I
> have HashMaps and Array of HashMaps?

I think you do not need to write your own serde. Hive's serde support
complex types and nested complex types.

> 2) Where should I specify the separators I need to use for the HashMaps
> and Array of HashMaps I am creating?
If you are writing out the data, and want to use hive's serde to read the
data. You can just use hive's default separators. (which is definded in
> 3) Should I be using LazyArray, LazyMap objects in my M/R program to get
> the required serializations?
If you want to use hive's built-in serde, you don't need to.
> 4) If I write out my original data using TextFormat instead of
> RCFileOutputFormat and make Hive read it as an external table and then
> store the corresponding results to RCFormat using Hive DDL commands, how
> does Hive convert to RC here. A) Can it do that?  b) If it did that what
> are the separators that are used in this case?
A) yes. It can do that.
B) the separators used are from the table's metadata. If not defined, it
will use default defined in LazySimpleSerde.
As long as data can be parsed by hive, hive can convert the data into what
format you want. So you need hive to be able to parse you text format (again
be careful of separators).
Basically hive use de-serializer to de-serialize the input data to hive's
built-in types and use serialzer to serialize the data out to hdfs.

Attached some code letting hive parse Zebra table which use pig's tuple as
it data type. Right now it can work well with primitive pig types. But It
should not be very difficult to extend to work with complex types.
Hope these code could be helpful to you. The code most related to serde is
under zebra/serde and

> Any insights would be appreciated.
> Thanks Viraj
> -----Original Message-----
> From: Yongqiang He []
> Sent: Tuesday, June 08, 2010 2:25 PM
> To:
> Subject: Re: Converting types from java HashMap, Long and Array to
> BytesWritable for RCFileOutputFormat
> Hi Viraj
> I recommend you to use Hive's columnserde/lazyserde's code to serialize
> and
> deserialize the data. This can help you avoid write your own way to
> serialze/deserialize the data.
> Basically, for primitives, it is easy to serialize and de-serialize. But
> for
> complex types, you need to use separators.
> Thanks
> Yongqiang
> On 6/8/10 10:50 AM, "Viraj Bhat" <> wrote:
>> Hi all,
>>   I am working on an M/R program to convert Zebra data to Hive RC
>> format. 
>> The TableInputFormat (Zebra) returns keys and values in the form of
>> BytesWritable and (Pig) Tuple.
>> In order to convert it to the RCFileOutputFormat whose key is
>> "BytesWritable and value is "BytesRefArrayWritable" I need to take in
> a
>> Pig Tuple iterate over each of its contents and convert it to
>> "BytesRefWritable".
>> The easy part is for Strings, which can be converted to
> BytesRefWritable
>> as:
>> myvalue = new BytesRefArrayWritable(10);
>> //value is a Pig Tuple and get returns a string
>> String s = (String)value.get(0);
>> myvalue.set(0, new BytesRefWritable(s.getBytes("UTF-8")));
>> How do I do it for java "Long", "HashMap" and "Arrays"
>> //value is a Pig tuple
>> Long l = new Long((Long)value.get(1));
>> myvalue.set(iter, new
> BytesRefWritable(l.toString().getBytes("UTF-8")));
>> myvalue.set(1, new BytesRefWritable(l.getBytes("UTF-8")));
>> HashMap<String, Object> hm = new
>> HashMap<String,Object>((HashMap)value.get(2));
>> myvalue.set(iter, new
>> BytesRefWritable(hm.toString().getBytes("UTF-8")));
>> Would the toString() method work? If I need to re-read RC format back
>> through the "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
> would
>> it interpret correctly?
>> Is there any documentation for the same?
>> Any suggestions would be beneficial.
>> Viraj

View raw message