Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: "Christopher, Pat" <patrick.christopher@hp.com>
To: "user@hive.apache.org" <user@hive.apache.org>
Date: Fri, 28 Jan 2011 22:34:57 +0000
Subject: RE: Custom SerDe Question
Thread-Topic: Custom SerDe Question
Thread-Index: Acu/NgG8WySYy72SQtaIBMgl5UyVAQABGeqw
Message-ID: 
 <FB0065ADBE630748A6D362671F0DE4E725A7775E5E@GVW0432EXB.americas.hpqcorp.net>
References: 
 <FB0065ADBE630748A6D362671F0DE4E725A7775C54@GVW0432EXB.americas.hpqcorp.net>
	<AANLkTim=YuONyeMv7uTcPGqPdZNGYw53c=tJ6qTeZdBQ@mail.gmail.com>
 <AANLkTin1DE8vL2u-m-v=-87ozBBj14G8AYH9am2M6Z5H@mail.gmail.com>
In-Reply-To: <AANLkTin1DE8vL2u-m-v=-87ozBBj14G8AYH9am2M6Z5H@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_FB0065ADBE630748A6D362671F0DE4E725A7775E5EGVW0432EXBame_"
MIME-Version: 1.0

--_000_FB0065ADBE630748A6D362671F0DE4E725A7775E5EGVW0432EXBame_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Not sure what I did wrong the first time but I tried to create a table with=
 stored type of textfile and using my custom serde so it had a format line =
of:

  ROW FORMAT SERDE 'org.myorg.hadoop.hive.udf.MySerDe' STORED AS textfile

Then I loaded a gzipped file using LOAD DATA LOCAL INPATH 'path.gz' INTO TA=
BLE mytable and it worked as expected, ie the file was read and I'm able to=
 query it using hive.

Sorry to bother and thanks a bunch for the help!  Forcing me to go read mor=
e about InputFormats is a long term help anyway.

Pat

From: phil young [mailto:phil.wills.young@gmail.com]
Sent: Friday, January 28, 2011 1:54 PM
To: user@hive.apache.org
Subject: Re: Custom SerDe Question

To be clear, you would then create the table with the clause:

STORED AS
  INPUTFORMAT 'your.custom.input.format'


If you make an external table, you'll then be able to point to a directory =
(or file) that contains gzipped files, or uncompressed files.


On Fri, Jan 28, 2011 at 4:52 PM, phil young <phil.wills.young@gmail.com<mai=
lto:phil.wills.young@gmail.com>> wrote:
This can be accomplished with a custom input format.

Here's a snippet of the relevant code in the customer RecordReader


            compressionCodecs =3D new CompressionCodecFactory(jobConf);

            Path file =3D split.getPath();

            final CompressionCodec codec =3D compressionCodecs.getCodec(fil=
e);

            // open the file and seek to the start of the split

            start =3D split.getStart();

            end =3D start + split.getLength();

            pos=3D0;


            FileSystem fs =3D file.getFileSystem(jobConf);

            fsdat =3D fs.open(split.getPath());

            fsdat.seek(start);


            if (codec !=3D null)

            {

                fsin =3D codec.createInputStream(fsdat);

            }

            else

            {

                fsin =3D fsdat;

            }


On Fri, Jan 28, 2011 at 1:57 PM, Christopher, Pat <patrick.christopher@hp.c=
om<mailto:patrick.christopher@hp.com>> wrote:
Hi,
I've written a SerDe and I'd like it to be able handle compressed data (gzi=
p).  Hadoop detects and decompresses on the fly so if you have a compressed=
 data set and you don't need to perform any custom interpretation of it as =
you go, hadoop and hive will handle it.  Is there a way to get Hive to noti=
ce the data is compressed, decompress it then push it through the custom Se=
rDe?  Or will I have to either
  a. add some decompression logic to my SerDe (possibly impossible)
  b. decompress the data before pushing it into a table with my SerDe

Thanks!

Pat


--_000_FB0065ADBE630748A6D362671F0DE4E725A7775E5EGVW0432EXBame_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40"><head><meta http-equiv=3DContent-Type content=
=3D"text/html; charset=3Dus-ascii"><meta name=3DGenerator content=3D"Micros=
oft Word 12 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p
	{mso-style-priority:99;
	mso-margin-top-alt:auto;
	margin-right:0in;
	mso-margin-bottom-alt:auto;
	margin-left:0in;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
span.EmailStyle18
	{mso-style-type:personal-reply;
	font-family:"Calibri","sans-serif";
	color:#1F497D;}
.MsoChpDefault
	{mso-style-type:export-only;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DEN-US link=3Dblue vli=
nk=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal><span style=3D'f=
ont-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Not sure =
what I did wrong the first time but I tried to create a table with stored t=
ype of textfile and using my custom serde so it had a format line of:<o:p><=
/o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:11.0pt;font-f=
amily:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p =
class=3DMsoNormal><span style=3D'font-size:11.0pt;font-family:"Calibri","sa=
ns-serif";color:#1F497D'>&nbsp; ROW FORMAT SERDE &#8216;org.myorg.hadoop.hi=
ve.udf.MySerDe&#8217; STORED AS textfile<o:p></o:p></span></p><p class=3DMs=
oNormal><span style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";=
color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span style=
=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Then=
 I loaded a gzipped file using LOAD DATA LOCAL INPATH &#8216;path.gz&#8217;=
 INTO TABLE mytable and it worked as expected, ie the file was read and I&#=
8217;m able to query it using hive.<o:p></o:p></span></p><p class=3DMsoNorm=
al><span style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color=
:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span style=3D'f=
ont-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Sorry to =
bother and thanks a bunch for the help!&nbsp; Forcing me to go read more ab=
out InputFormats is a long term help anyway.<o:p></o:p></span></p><p class=
=3DMsoNormal><span style=3D'font-size:11.0pt;font-family:"Calibri","sans-se=
rif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'=
>Pat<o:p></o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:11.=
0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></sp=
an></p><div style=3D'border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0=
pt 0in 0in 0in'><p class=3DMsoNormal><b><span style=3D'font-size:10.0pt;fon=
t-family:"Tahoma","sans-serif"'>From:</span></b><span style=3D'font-size:10=
.0pt;font-family:"Tahoma","sans-serif"'> phil young [mailto:phil.wills.youn=
g@gmail.com] <br><b>Sent:</b> Friday, January 28, 2011 1:54 PM<br><b>To:</b=
> user@hive.apache.org<br><b>Subject:</b> Re: Custom SerDe Question<o:p></o=
:p></span></p></div><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMs=
oNormal>To be clear, you would then create the table with the clause:<o:p><=
/o:p></p><div><p class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><div><p =
class=3DMsoNormal>STORED AS&nbsp;<o:p></o:p></p></div><div><p class=3DMsoNo=
rmal>&nbsp;&nbsp;INPUTFORMAT 'your.custom.input.format'<o:p></o:p></p></div=
><div><p class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p class=3DMsoNo=
rmal><o:p>&nbsp;</o:p></p></div><div><p class=3DMsoNormal>If you make an ex=
ternal table, you'll then be able to point to a directory (or file) that co=
ntains gzipped files, or uncompressed files.<o:p></o:p></p></div><div><p cl=
ass=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p class=3DMsoNormal><o:p>&=
nbsp;</o:p></p></div><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><div><p clas=
s=3DMsoNormal>On Fri, Jan 28, 2011 at 4:52 PM, phil young &lt;<a href=3D"ma=
ilto:phil.wills.young@gmail.com">phil.wills.young@gmail.com</a>&gt; wrote:<=
o:p></o:p></p><div><p class=3DMsoNormal>This can be accomplished with a cus=
tom input format.<o:p></o:p></p></div><div><p class=3DMsoNormal><o:p>&nbsp;=
</o:p></p></div><div><p class=3DMsoNormal>Here's a snippet of the relevant =
code in the customer RecordReader<o:p></o:p></p></div><div><p>&nbsp;&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<o:p></o:p></p><p><o:p>&nbsp;</o:p></p><p=
>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; compressionCodecs =3D new Compre=
ssionCodecFactory(jobConf);<o:p></o:p></p><p>&nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; Path file =3D split.getPath();<o:p></o:p></p><p>&nbsp; &nbsp; &=
nbsp; &nbsp; &nbsp; &nbsp; final CompressionCodec codec =3D compressionCode=
cs.getCodec(file);<o:p></o:p></p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; // open the file and seek to the start of the split<o:p></o:p></p><p>&nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; start =3D split.getStart();<o:p></o:=
p></p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; end =3D start + split.ge=
tLength();<o:p></o:p></p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; pos=
=3D0;<o:p></o:p></p><p><o:p>&nbsp;</o:p></p><p>&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; FileSystem fs =3D file.getFileSystem(jobConf);<o:p></o:p></p>=
<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; fsdat =3D fs.open(split.getPat=
h());<o:p></o:p></p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; fsdat.seek=
(start);<o:p></o:p></p><p><o:p>&nbsp;</o:p></p><p>&nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; if (codec !=3D null)<o:p></o:p></p><p>&nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; {<o:p></o:p></p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; fsin =3D codec.createInputStream(fsdat);<o:p></o:p></=
p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }<o:p></o:p></p><p>&nbsp; &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; else<o:p></o:p></p><p>&nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; {<o:p></o:p></p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; fsin =3D fsdat;<o:p></o:p></p><p>&nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; }<o:p></o:p></p><p><o:p>&nbsp;</o:p></p><p><o:p>&nbsp=
;</o:p></p><p><o:p>&nbsp;</o:p></p></div><div><div><div><p class=3DMsoNorma=
l><o:p>&nbsp;</o:p></p></div><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><div=
><p class=3DMsoNormal>On Fri, Jan 28, 2011 at 1:57 PM, Christopher, Pat &lt=
;<a href=3D"mailto:patrick.christopher@hp.com" target=3D"_blank">patrick.ch=
ristopher@hp.com</a>&gt; wrote:<o:p></o:p></p><div><div><p class=3DMsoNorma=
l style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'>Hi,<o:p></o:=
p></p><p class=3DMsoNormal style=3D'mso-margin-top-alt:auto;mso-margin-bott=
om-alt:auto'>I&#8217;ve written a SerDe and I&#8217;d like it to be able ha=
ndle compressed data (gzip).&nbsp; Hadoop detects and decompresses on the f=
ly so if you have a compressed data set and you don&#8217;t need to perform=
 any custom interpretation of it as you go, hadoop and hive will handle it.=
&nbsp; Is there a way to get Hive to notice the data is compressed, decompr=
ess it then push it through the custom SerDe?&nbsp; Or will I have to eithe=
r <o:p></o:p></p><p class=3DMsoNormal style=3D'mso-margin-top-alt:auto;mso-=
margin-bottom-alt:auto'>&nbsp;&nbsp;a. add some decompression logic to my S=
erDe (possibly impossible)<o:p></o:p></p><p class=3DMsoNormal style=3D'mso-=
margin-top-alt:auto;mso-margin-bottom-alt:auto'>&nbsp; b. decompress the da=
ta before pushing it into a table with my SerDe<o:p></o:p></p><p class=3DMs=
oNormal style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'>&nbsp;=
<o:p></o:p></p><p class=3DMsoNormal style=3D'mso-margin-top-alt:auto;mso-ma=
rgin-bottom-alt:auto'>Thanks!<o:p></o:p></p><p class=3DMsoNormal style=3D'm=
so-margin-top-alt:auto;mso-margin-bottom-alt:auto'>&nbsp;<o:p></o:p></p><p =
class=3DMsoNormal style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:au=
to'><span style=3D'color:#888888'>Pat<o:p></o:p></span></p></div></div></di=
v><p class=3DMsoNormal><o:p>&nbsp;</o:p></p></div></div></div><p class=3DMs=
oNormal><o:p>&nbsp;</o:p></p></div></div></body></html>=

--_000_FB0065ADBE630748A6D362671F0DE4E725A7775E5EGVW0432EXBame_--