hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Quigley <dquigle...@gmail.com>
Subject Deserializing into multiple records
Date Wed, 02 Apr 2014 03:45:34 GMT
We are currently streaming complex documents to hdfs with the hope of being
able to query. Each single document logically breaks down into a set of
individual records. In order to use Hive, we preprocess each input document
into a set of discreet records, which we save on HDFS and create an
external table on top of.

This approach works, but we end up duplicating a lot of data in the
records. It would be much more efficient to deserialize the document into a
set of records when a query is made. That way, we can just save the raw
documents on HDFS.

I have looked into writing a cusom SerDe.

Object<http://java.sun.com/javase/6/docs/api/java/lang/Object.html?is-external=true>
 *deserialize*(org.apache.hadoop.io.Writable blob)

It looks like the input record => deserialized record still needs to be a
1:1 relationship. Is there any way to deserialize a record into multiple
records?

Thanks,
Dave

Mime
View raw message