Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (nike.apache.org: domain of dquigley89@gmail.com designates
 209.85.217.182 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CADFYdEppyy0Swow9pka_vYXOk5b4AJW9sPxjXtdnsYxw2gi6sw@mail.gmail.com>
References: 
 <CA+TjGy0bOaGTNu4iGbdAn_yLHARfDYLf4+fJRoaaVjU66E40kQ@mail.gmail.com>
	<CADFYdEppyy0Swow9pka_vYXOk5b4AJW9sPxjXtdnsYxw2gi6sw@mail.gmail.com>
Date: Wed, 2 Apr 2014 06:53:23 -0700
Message-ID: 
 <CA+TjGy38sxRBVUy3Jwbqa7dBThohCDbCexUMojeo7isL6R2j-A@mail.gmail.com>
Subject: Re: Deserializing into multiple records
From: David Quigley <dquigley89@gmail.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=089e0158b87cb192f504f60f9b3f

--089e0158b87cb192f504f60f9b3f
Content-Type: text/plain; charset=ISO-8859-1

Makes perfect sense, thanks Petter!


On Wed, Apr 2, 2014 at 2:15 AM, Petter von Dolwitz (Hem) <
petter.von.dolwitz@gmail.com> wrote:

> Hi David,
>
> you can implement a custom InputFormat (extends
> org.apache.hadoop.mapred.FileInputFormat) accompanied by a custom
> RecordReader (implements org.apache.hadoop.mapred.RecordReader). The
> RecordReader will be used to read your documents and from there you can
> decide which units you will return as records (return by the next()
> method). You'll still probably need a SerDe that transforms your data into
> Hive data types using 1:1 mapping.
>
> In this way you can choose only to duplicate your data while your query
> runs (and possible in the results) to avoid JOIN operations but the raw
> files will not contain duplicate data.
>
> Something like this:
>
> CREATE EXTERNAL TABLE IF NOT EXISTS MyTable (
>   myfield1 STRING,
>   myfield2 INT)
>   PARTITIONED BY (your_partition_if_appliccable STRING)
>   ROW FORMAT SERDE 'quigley.david.myserde'
>   STORED AS INPUTFORMAT 'quigley.david.myinputformat' OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>   LOCATION 'mylocation';
>
>
> Hope this helps.
>
> Br,
> Petter
>
>
>
>
> 2014-04-02 5:45 GMT+02:00 David Quigley <dquigley89@gmail.com>:
>
> We are currently streaming complex documents to hdfs with the hope of
>> being able to query. Each single document logically breaks down into a set
>> of individual records. In order to use Hive, we preprocess each input
>> document into a set of discreet records, which we save on HDFS and create
>> an external table on top of.
>>
>> This approach works, but we end up duplicating a lot of data in the
>> records. It would be much more efficient to deserialize the document into a
>> set of records when a query is made. That way, we can just save the raw
>> documents on HDFS.
>>
>> I have looked into writing a cusom SerDe.
>>
>> Object<http://java.sun.com/javase/6/docs/api/java/lang/Object.html?is-external=true>
>>  *deserialize*(org.apache.hadoop.io.Writable blob)
>>
>> It looks like the input record => deserialized record still needs to be a
>> 1:1 relationship. Is there any way to deserialize a record into multiple
>> records?
>>
>> Thanks,
>> Dave
>>
>
>

--089e0158b87cb192f504f60f9b3f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Makes perfect sense, thanks Petter!</div><div class=3D"gma=
il_extra"><br><br><div class=3D"gmail_quote">On Wed, Apr 2, 2014 at 2:15 AM=
, Petter von Dolwitz (Hem) <span dir=3D"ltr">&lt;<a href=3D"mailto:petter.v=
on.dolwitz@gmail.com" target=3D"_blank">petter.von.dolwitz@gmail.com</a>&gt=
;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div><div>Hi=
 David,<br><br></div>you can implement a custom InputFormat (extends org.ap=
ache.hadoop.mapred.FileInputFormat) accompanied by a custom RecordReader (i=
mplements org.apache.hadoop.mapred.RecordReader). The RecordReader will be =
used to read your documents and from there you can decide which units you w=
ill return as records (return by the next() method). You&#39;ll still proba=
bly need a SerDe that transforms your data into Hive data types using 1:1 m=
apping.<br>

<br></div>In this way you can choose only to duplicate your data while your=
 query runs (and possible in the results) to avoid JOIN operations but the =
raw files will not contain duplicate data.<br><br></div>Something like this=
:<br>

<br>CREATE EXTERNAL TABLE IF NOT EXISTS MyTable (<br>=A0 myfield1 STRING,<b=
r>=A0 myfield2 INT)<br>=A0 PARTITIONED BY (your_partition_if_appliccable ST=
RING)<br>=A0 ROW FORMAT SERDE &#39;quigley.david.myserde&#39;<br>=A0 STORED=
 AS INPUTFORMAT &#39;quigley.david.myinputformat&#39; OUTPUTFORMAT &#39;org=
.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat&#39;<br>

=A0 LOCATION &#39;mylocation&#39;;<br><br><br></div>Hope this helps.<br><br=
></div>Br,<br>Petter<br><div><div><div><div><br><br></div></div></div></div=
></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">2014-0=
4-02 5:45 GMT+02:00 David Quigley <span dir=3D"ltr">&lt;<a href=3D"mailto:d=
quigley89@gmail.com" target=3D"_blank">dquigley89@gmail.com</a>&gt;</span>:=
<div>
<div class=3D"h5"><br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><span style=3D"font-family:=
arial,sans-serif;font-size:13px">We are currently streaming complex documen=
ts to hdfs with the hope of being able to query. Each single document logic=
ally breaks down into a set of individual records. In order to use Hive, we=
 preprocess each input document into a set of discreet records, which we sa=
ve on HDFS and create an external table on top of.</span><div style=3D"font=
-family:arial,sans-serif;font-size:13px">


<br></div><div style=3D"font-family:arial,sans-serif;font-size:13px">This a=
pproach works, but we end up duplicating a lot of data in the records. It w=
ould be much more efficient to deserialize the document into a set of recor=
ds when a query is made. That way, we can just save the raw documents on HD=
FS.=A0</div>


<div style=3D"font-family:arial,sans-serif;font-size:13px"><br></div><div s=
tyle=3D"font-family:arial,sans-serif;font-size:13px">I have looked into wri=
ting a cusom SerDe.=A0</div><div style=3D"font-family:arial,sans-serif;font=
-size:13px">


<br></div><div style=3D"font-family:arial,sans-serif;font-size:13px"><a hre=
f=3D"http://java.sun.com/javase/6/docs/api/java/lang/Object.html?is-externa=
l=3Dtrue" title=3D"class or interface in java.lang" target=3D"_blank">Objec=
t</a>=A0<b>deserialize</b>(org.apache.hadoop.io.Writable=A0blob)</div>


<div style=3D"font-family:arial,sans-serif;font-size:13px"><br></div><div s=
tyle=3D"font-family:arial,sans-serif;font-size:13px"><font color=3D"#000000=
">It looks like the input record =3D&gt; deserialized record still needs to=
 be a 1:1 relationship. Is there any way to deserialize a record into multi=
ple records?</font></div>


<div style=3D"font-family:arial,sans-serif;font-size:13px"><font color=3D"#=
000000"><br></font></div><div style=3D"font-family:arial,sans-serif;font-si=
ze:13px"><font color=3D"#000000">Thanks,</font></div><div style=3D"font-fam=
ily:arial,sans-serif;font-size:13px">


<font color=3D"#000000">Dave</font></div></div>
</blockquote></div></div></div><br></div>
</blockquote></div><br></div>

--089e0158b87cb192f504f60f9b3f--