hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Quigley <>
Subject Deserializing into multiple records
Date Wed, 02 Apr 2014 03:45:34 GMT
We are currently streaming complex documents to hdfs with the hope of being
able to query. Each single document logically breaks down into a set of
individual records. In order to use Hive, we preprocess each input document
into a set of discreet records, which we save on HDFS and create an
external table on top of.

This approach works, but we end up duplicating a lot of data in the
records. It would be much more efficient to deserialize the document into a
set of records when a query is made. That way, we can just save the raw
documents on HDFS.

I have looked into writing a cusom SerDe.

 *deserialize*( blob)

It looks like the input record => deserialized record still needs to be a
1:1 relationship. Is there any way to deserialize a record into multiple


View raw message