hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Quigley <dquigle...@gmail.com>
Subject Re: Deserializing into multiple records
Date Wed, 02 Apr 2014 13:53:23 GMT
Makes perfect sense, thanks Petter!


On Wed, Apr 2, 2014 at 2:15 AM, Petter von Dolwitz (Hem) <
petter.von.dolwitz@gmail.com> wrote:

> Hi David,
>
> you can implement a custom InputFormat (extends
> org.apache.hadoop.mapred.FileInputFormat) accompanied by a custom
> RecordReader (implements org.apache.hadoop.mapred.RecordReader). The
> RecordReader will be used to read your documents and from there you can
> decide which units you will return as records (return by the next()
> method). You'll still probably need a SerDe that transforms your data into
> Hive data types using 1:1 mapping.
>
> In this way you can choose only to duplicate your data while your query
> runs (and possible in the results) to avoid JOIN operations but the raw
> files will not contain duplicate data.
>
> Something like this:
>
> CREATE EXTERNAL TABLE IF NOT EXISTS MyTable (
>   myfield1 STRING,
>   myfield2 INT)
>   PARTITIONED BY (your_partition_if_appliccable STRING)
>   ROW FORMAT SERDE 'quigley.david.myserde'
>   STORED AS INPUTFORMAT 'quigley.david.myinputformat' OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>   LOCATION 'mylocation';
>
>
> Hope this helps.
>
> Br,
> Petter
>
>
>
>
> 2014-04-02 5:45 GMT+02:00 David Quigley <dquigley89@gmail.com>:
>
> We are currently streaming complex documents to hdfs with the hope of
>> being able to query. Each single document logically breaks down into a set
>> of individual records. In order to use Hive, we preprocess each input
>> document into a set of discreet records, which we save on HDFS and create
>> an external table on top of.
>>
>> This approach works, but we end up duplicating a lot of data in the
>> records. It would be much more efficient to deserialize the document into a
>> set of records when a query is made. That way, we can just save the raw
>> documents on HDFS.
>>
>> I have looked into writing a cusom SerDe.
>>
>> Object<http://java.sun.com/javase/6/docs/api/java/lang/Object.html?is-external=true>
>>  *deserialize*(org.apache.hadoop.io.Writable blob)
>>
>> It looks like the input record => deserialized record still needs to be a
>> 1:1 relationship. Is there any way to deserialize a record into multiple
>> records?
>>
>> Thanks,
>> Dave
>>
>
>

Mime
View raw message