hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Quigley <>
Subject Re: Deserializing into multiple records
Date Fri, 04 Apr 2014 04:02:26 GMT
Thanks again Petter, the custom input format was exactly what I needed.

Here is example of my code in case anyone is interested

Basically gives you SQL access to arbitrary json data. I know there are
solutions for dealing with JSON data in hive fields but nothing I saw
actually decomposes nested JSON into a set of discreet records. Its super
useful for us.

On Wed, Apr 2, 2014 at 2:15 AM, Petter von Dolwitz (Hem) <> wrote:

> Hi David,
> you can implement a custom InputFormat (extends
> org.apache.hadoop.mapred.FileInputFormat) accompanied by a custom
> RecordReader (implements org.apache.hadoop.mapred.RecordReader). The
> RecordReader will be used to read your documents and from there you can
> decide which units you will return as records (return by the next()
> method). You'll still probably need a SerDe that transforms your data into
> Hive data types using 1:1 mapping.
> In this way you can choose only to duplicate your data while your query
> runs (and possible in the results) to avoid JOIN operations but the raw
> files will not contain duplicate data.
> Something like this:
>   myfield1 STRING,
>   myfield2 INT)
>   PARTITIONED BY (your_partition_if_appliccable STRING)
>   ROW FORMAT SERDE 'quigley.david.myserde'
>   STORED AS INPUTFORMAT 'quigley.david.myinputformat' OUTPUTFORMAT
> ''
>   LOCATION 'mylocation';
> Hope this helps.
> Br,
> Petter
> 2014-04-02 5:45 GMT+02:00 David Quigley <>:
> We are currently streaming complex documents to hdfs with the hope of
>> being able to query. Each single document logically breaks down into a set
>> of individual records. In order to use Hive, we preprocess each input
>> document into a set of discreet records, which we save on HDFS and create
>> an external table on top of.
>> This approach works, but we end up duplicating a lot of data in the
>> records. It would be much more efficient to deserialize the document into a
>> set of records when a query is made. That way, we can just save the raw
>> documents on HDFS.
>> I have looked into writing a cusom SerDe.
>> Object<>
>>  *deserialize*( blob)
>> It looks like the input record => deserialized record still needs to be a
>> 1:1 relationship. Is there any way to deserialize a record into multiple
>> records?
>> Thanks,
>> Dave

View raw message