hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Shao <zsh...@gmail.com>
Subject Re: SerDe with a binary formatted file.
Date Mon, 13 Apr 2009 18:06:12 GMT
Hi Bill,

There are 2 missing pieces of code to make Hive directly read data like
this:

1. FileFormat: We need to write a derived class of InputFileFormat in Hadoop
to be able to read this file format. FileFormat tells us how the rows are
stored in the file.
2. ProtocolBufferSerDe. We need to write a class to implement the SerDe
interface from Hive. SerDe tells us what is the format of the row.

Let us know if you have more questions on this.

Zheng

On Mon, Apr 13, 2009 at 9:06 AM, Bill Craig <bcraig7@gmail.com> wrote:

> I am attempting to write a SerDe implementation to load a binary
> formatted file which consists of the following repeating form:
>
> Integer (4 Bytes, length of binary block)
> Binary block of data of variable length designated by the preceding
> Integer value (This happens to be a protocol buffer).
>
> Deserializing the protocol buffer is fairly straight forward if given
> the correct size writeable blob from Hive. However, since the file is
> binary I do not see how to give Hive a way to send me the correct size
> blob of data.  There is no way to specify a “row delimited by” .
> While this problem is using Protocol Buffers it should be the same as
> parsing the input to any binary file that requires  sequential
> reading.  I have been looking into extending the byteswritable
> interface, which would work with a direct hadoop read but I don’t know
> how to get Hive to read using that interface.
>
> I know I can make a hadoop format program and reformat these files but
> there  is quite a lot of data and would like to avoid doing that.
>
> Am I missing something obvious?
>



-- 
Yours,
Zheng

Mime
View raw message