Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
From: shrikanth shankar <sshankar@qubole.com>
Mime-Version: 1.0 (Apple Message framework v1278)
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_00D5BA18-82ED-4B8A-9BED-A37F0BDEC2B0"
Subject: Re: Writing Custom Serdes for Hive
Date: Tue, 16 Oct 2012 09:09:26 -0700
In-Reply-To: 
 <CAKOFcwpWozChk6oXKz6wG5pWn4g-7e3obUUea3_aLF+WET79Jw@mail.gmail.com>
To: user@hive.apache.org
References: 
 <CAKOFcwp+mT4NY48vdgHbzyLXzXVxASh3GXmYLLE=cfshcwC4qA@mail.gmail.com>
 <9D8A350A3269554E91B45801B5E8CDAC68585A@SOM-EXCH02.nuance.com>
 <CAKOFcwpWozChk6oXKz6wG5pWn4g-7e3obUUea3_aLF+WET79Jw@mail.gmail.com>
Message-Id: <B67F5685-89AA-4728-A46F-FD26A74D7059@qubole.com>


--Apple-Mail=_00D5BA18-82ED-4B8A-9BED-A37F0BDEC2B0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252

I think what you need is a custom Input Format/ Record Reader. By the =
time the SerDe is called the row has been fetched. I believe the record =
reader can get access to predicates. The code to access HBase from Hive =
needs it for the same reasons as you would need with Mongo and might be =
a good place to start.=20

thanks,
Shrikanth
On Oct 16, 2012, at 8:54 AM, John Omernik wrote:

> There reason I am asking (and maybe YC reads this list and can chime =
in) but he has written a connector for MongoDB.  It's simple, basically =
it connects to a MongoDB, maps columns (primitives only) to mongodb =
fields, and allows you to select out of Mongo. Pretty sweet actually, =
and with Mongo, things are really fast for small tables. =20
>=20
>=20
> That being said, I noticed that his connector basically gets all rows =
from a Mongo DB collection every time it's ran.  And we wanted to see if =
we could extend it to do some simple MongoDB level filtering based on =
the passed query.  Basically have a fail open approach... if it saw =
something it thought it could optimize in the mongodb query to limit =
data, it would, otherwise, it would default to the original approach of =
getting all the data. =20
>=20
>=20
> For example:
>=20
> select * from mongo_table where name rlike 'Bobby\\sWhite'
>=20
> Current method: the connection do db.collection.find() gets all the =
documents from MongoDB, and then hive does the regex. =20
>=20
> Thing we want to try "Oh one of our defined mongo columns has a rlike, =
ok send this instead: db.collection.find("name":/Bobby\sWhite");   less =
data that would need to be transfered. Yes, Hive would still run the =
rlike on the data... "shrug" at least it's running it on far less data.  =
 Basically if we could determine shortcuts, we could use them.=20
>=20
>=20
> Just trying to understand Serdes and how we are completely not using =
them as intended :)=20
>=20
>=20
>=20
>=20
> On Tue, Oct 16, 2012 at 10:42 AM, Connell, Chuck =
<Chuck.Connell@nuance.com> wrote:
> A serde is actually used the other way around=85 Hive parses the =
query, writes MapReduce code to solve the query, and the generated code =
uses the serde for field access.
>=20
> =20
>=20
> Standard way to write a serde is to start from the trunk regex serde, =
then modify as needed=85
>=20
> =20
>=20
> =
http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apache/hadoop=
/hive/contrib/serde2/RegexSerDe.java?revision=3D1131106&view=3Dmarkup
>=20
>=20
> Also, nice article by Roberto Congiu=85
>=20
> =20
>=20
> http://www.congiu.com/a-json-readwrite-serde-for-hive/
>=20
> =20
>=20
> Chuck Connell
>=20
> Nuance R&D Data Team
>=20
> Burlington, MA
>=20
> =20
>=20
> =20
>=20
> From: John Omernik [mailto:john@omernik.com]=20
> Sent: Tuesday, October 16, 2012 11:30 AM
> To: user@hive.apache.org
> Subject: Writing Custom Serdes for Hive
>=20
> =20
>=20
> We have a maybe obvious question about a serde. When a serde in =
invoked, does it have access to the original hive query?  Ideally the =
original query could provide the Serde some hints on how to access the =
data on the backend. =20
>=20
> =20
>=20
> Also, are there any good links/documention on how to write Serdes?  =
Kinda hard to google on for some reason.=20
>=20
> =20
>=20
> =20
>=20
>=20


--Apple-Mail=_00D5BA18-82ED-4B8A-9BED-A37F0BDEC2B0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=windows-1252

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">I =
think what you need is a custom Input Format/ Record Reader. By the time =
the SerDe is called the row has been fetched. I believe the record =
reader can get access to predicates. The code to access HBase from Hive =
needs it for the same reasons as you would need with Mongo and might be =
a good place to =
start.&nbsp;<div><br></div><div>thanks,</div><div>Shrikanth<br><div><div>O=
n Oct 16, 2012, at 8:54 AM, John Omernik wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite">There =
reason I am asking (and maybe YC reads this list and can chime in) but =
he has written a connector for MongoDB. &nbsp;It's simple, basically it =
connects to a MongoDB, maps columns (primitives&nbsp;only) to mongodb =
fields, and allows you to select out of Mongo. Pretty sweet actually, =
and with Mongo, things are really fast for small tables. &nbsp;<div>

<br></div><div><br></div><div>That being said, I noticed that his =
connector basically gets all rows from a Mongo DB collection every time =
it's ran. &nbsp;And we wanted to see if we could extend it to do some =
simple MongoDB level filtering based on the passed query. =
&nbsp;Basically have a fail open approach... if it saw something it =
thought it could optimize in the mongodb query to limit data, it would, =
otherwise, it would default to the original approach of getting all the =
data. &nbsp;</div>

<div><br></div><div><br></div><div>For =
example:</div><div><br></div><div>select * from mongo_table where name =
rlike 'Bobby\\sWhite'</div><div><br></div><div>Current method: the =
connection do db.collection.find() gets all the documents from MongoDB, =
and then hive does the regex. &nbsp;</div>

<div><br></div><div>Thing we want to try "Oh one of our defined mongo =
columns has a rlike, ok send this instead: =
db.collection.find("name":/Bobby\sWhite"); &nbsp; less data that would =
need to be transfered. Yes, Hive would still run the rlike on the =
data... "shrug" at least it's running it on far less data. &nbsp; =
Basically if we could determine shortcuts, we could use =
them.&nbsp;</div>

<div><br></div><div><br></div><div>Just trying to understand Serdes and =
how we are completely not using them as intended =
:)&nbsp;</div><div><br></div><div><br></div><div><br><br><div =
class=3D"gmail_quote">On Tue, Oct 16, 2012 at 10:42 AM, Connell, Chuck =
<span dir=3D"ltr">&lt;<a href=3D"mailto:Chuck.Connell@nuance.com" =
target=3D"_blank">Chuck.Connell@nuance.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">


<div lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
<div><p class=3D"MsoNormal"><span =
style=3D"font-family:&quot;Arial&quot;,&quot;sans-serif&quot;">A serde =
is actually used the other way around=85 Hive parses the query, writes =
MapReduce code to solve the query, and the generated code uses the serde =
for field access.<u></u><u></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-family:&quot;Arial&quot;,&quot;sans-serif&quot;"><u></u>&nbs=
p;<u></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-family:&quot;Arial&quot;,&quot;sans-serif&quot;">Standard =
way to write a serde is to start from the trunk regex serde, then modify =
as needed=85<u></u><u></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1f497=
d"><u></u>&nbsp;<u></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-family:&quot;Arial&quot;,&quot;sans-serif&quot;"><a =
href=3D"http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apach=
e/hadoop/hive/contrib/serde2/RegexSerDe.java?revision=3D1131106&amp;view=3D=
markup" =
target=3D"_blank">http://svn.apache.org/viewvc/hive/trunk/contrib/src/java=
/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java?revision=3D1131106&=
amp;view=3Dmarkup</a><br>


<br>
<u></u><u></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-family:&quot;Arial&quot;,&quot;sans-serif&quot;">Also, =
nice article by Roberto Congiu=85<u></u><u></u></span></p><p =
class=3D"MsoNormal"><span =
style=3D"font-family:&quot;Arial&quot;,&quot;sans-serif&quot;"><u></u>&nbs=
p;<u></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-family:&quot;Arial&quot;,&quot;sans-serif&quot;"><a =
href=3D"http://www.congiu.com/a-json-readwrite-serde-for-hive/" =
target=3D"_blank">http://www.congiu.com/a-json-readwrite-serde-for-hive/</=
a><u></u><u></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-family:&quot;Arial&quot;,&quot;sans-serif&quot;"><u></u>&nbs=
p;<u></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-family:&quot;Arial&quot;,&quot;sans-serif&quot;">Chuck =
Connell<u></u><u></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-family:&quot;Arial&quot;,&quot;sans-serif&quot;">Nuance =
R&amp;D Data Team<u></u><u></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-family:&quot;Arial&quot;,&quot;sans-serif&quot;">Burlington,=
 MA<u></u><u></u></span></p><p class=3D"MsoNormal"><span =
style=3D"font-size:10.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&q=
uot;;color:#1f497d"><u></u>&nbsp;<u></u></span></p><p =
class=3D"MsoNormal"><span =
style=3D"color:#1f497d"><u></u>&nbsp;<u></u></span></p>
<div style=3D"border:none;border-top:solid #b5c4df 1.0pt;padding:3.0pt =
0in 0in 0in"><p class=3D"MsoNormal"><b><span =
style=3D"font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&=
quot;">From:</span></b><span =
style=3D"font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&=
quot;"> John Omernik [mailto:<a href=3D"mailto:john@omernik.com" =
target=3D"_blank">john@omernik.com</a>]
<br>
<b>Sent:</b> Tuesday, October 16, 2012 11:30 AM<br>
<b>To:</b> <a href=3D"mailto:user@hive.apache.org" =
target=3D"_blank">user@hive.apache.org</a><br>
<b>Subject:</b> Writing Custom Serdes for Hive<u></u><u></u></span></p>
</div><div class=3D"im"><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p><p =
class=3D"MsoNormal">We have a maybe obvious question about a serde. When =
a serde in invoked, does it have access to the original hive query? =
&nbsp;Ideally the original query could provide the Serde some hints on =
how to access the data on the backend. &nbsp;<u></u><u></u></p>


<div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div><p class=3D"MsoNormal">Also, are there any good links/documention =
on how to write Serdes? &nbsp;Kinda hard to google on for some =
reason.&nbsp;<u></u><u></u></p>
</div>
<div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div><p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
</div></div>
</div>

</blockquote></div><br></div>
</blockquote></div><br></div></body></html>=

--Apple-Mail=_00D5BA18-82ED-4B8A-9BED-A37F0BDEC2B0--