Return-Path: Delivered-To: apmail-hive-user-archive@www.apache.org Received: (qmail 87265 invoked from network); 28 Jan 2011 22:36:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Jan 2011 22:36:36 -0000 Received: (qmail 21336 invoked by uid 500); 28 Jan 2011 22:36:35 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 21283 invoked by uid 500); 28 Jan 2011 22:36:35 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 21274 invoked by uid 99); 28 Jan 2011 22:36:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Jan 2011 22:36:35 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [15.201.24.17] (HELO g4t0014.houston.hp.com) (15.201.24.17) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Jan 2011 22:36:24 +0000 Received: from G5W0603.americas.hpqcorp.net (g5w0603.americas.hpqcorp.net [16.228.9.186]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by g4t0014.houston.hp.com (Postfix) with ESMTPS id F2B06240EB for ; Fri, 28 Jan 2011 22:36:01 +0000 (UTC) Received: from G3W0628.americas.hpqcorp.net (16.233.58.53) by G5W0603.americas.hpqcorp.net (16.228.9.186) with Microsoft SMTP Server (TLS) id 8.2.176.0; Fri, 28 Jan 2011 22:34:56 +0000 Received: from GVW0432EXB.americas.hpqcorp.net ([16.234.32.146]) by G3W0628.americas.hpqcorp.net ([16.233.58.53]) with mapi; Fri, 28 Jan 2011 22:34:55 +0000 From: "Christopher, Pat" To: "user@hive.apache.org" Date: Fri, 28 Jan 2011 22:34:57 +0000 Subject: RE: Custom SerDe Question Thread-Topic: Custom SerDe Question Thread-Index: Acu/NgG8WySYy72SQtaIBMgl5UyVAQABGeqw Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_FB0065ADBE630748A6D362671F0DE4E725A7775E5EGVW0432EXBame_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_FB0065ADBE630748A6D362671F0DE4E725A7775E5EGVW0432EXBame_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Not sure what I did wrong the first time but I tried to create a table with= stored type of textfile and using my custom serde so it had a format line = of: ROW FORMAT SERDE 'org.myorg.hadoop.hive.udf.MySerDe' STORED AS textfile Then I loaded a gzipped file using LOAD DATA LOCAL INPATH 'path.gz' INTO TA= BLE mytable and it worked as expected, ie the file was read and I'm able to= query it using hive. Sorry to bother and thanks a bunch for the help! Forcing me to go read mor= e about InputFormats is a long term help anyway. Pat From: phil young [mailto:phil.wills.young@gmail.com] Sent: Friday, January 28, 2011 1:54 PM To: user@hive.apache.org Subject: Re: Custom SerDe Question To be clear, you would then create the table with the clause: STORED AS INPUTFORMAT 'your.custom.input.format' If you make an external table, you'll then be able to point to a directory = (or file) that contains gzipped files, or uncompressed files. On Fri, Jan 28, 2011 at 4:52 PM, phil young > wrote: This can be accomplished with a custom input format. Here's a snippet of the relevant code in the customer RecordReader compressionCodecs =3D new CompressionCodecFactory(jobConf); Path file =3D split.getPath(); final CompressionCodec codec =3D compressionCodecs.getCodec(fil= e); // open the file and seek to the start of the split start =3D split.getStart(); end =3D start + split.getLength(); pos=3D0; FileSystem fs =3D file.getFileSystem(jobConf); fsdat =3D fs.open(split.getPath()); fsdat.seek(start); if (codec !=3D null) { fsin =3D codec.createInputStream(fsdat); } else { fsin =3D fsdat; } On Fri, Jan 28, 2011 at 1:57 PM, Christopher, Pat > wrote: Hi, I've written a SerDe and I'd like it to be able handle compressed data (gzi= p). Hadoop detects and decompresses on the fly so if you have a compressed= data set and you don't need to perform any custom interpretation of it as = you go, hadoop and hive will handle it. Is there a way to get Hive to noti= ce the data is compressed, decompress it then push it through the custom Se= rDe? Or will I have to either a. add some decompression logic to my SerDe (possibly impossible) b. decompress the data before pushing it into a table with my SerDe Thanks! Pat --_000_FB0065ADBE630748A6D362671F0DE4E725A7775E5EGVW0432EXBame_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Not sure = what I did wrong the first time but I tried to create a table with stored t= ype of textfile and using my custom serde so it had a format line of:<= /o:p>

 

  ROW FORMAT SERDE ‘org.myorg.hadoop.hi= ve.udf.MySerDe’ STORED AS textfile

 

Then= I loaded a gzipped file using LOAD DATA LOCAL INPATH ‘path.gz’= INTO TABLE mytable and it worked as expected, ie the file was read and I&#= 8217;m able to query it using hive.

 

Sorry to = bother and thanks a bunch for the help!  Forcing me to go read more ab= out InputFormats is a long term help anyway.

 

Pat

 

From: phil young [mailto:phil.wills.youn= g@gmail.com]
Sent: Friday, January 28, 2011 1:54 PM
To: user@hive.apache.org
Subject: Re: Custom SerDe Question

 

To be clear, you would then create the table with the clause:<= /o:p>

 

STORED AS 

  INPUTFORMAT 'your.custom.input.format'

 

 

If you make an ex= ternal table, you'll then be able to point to a directory (or file) that co= ntains gzipped files, or uncompressed files.

 

&= nbsp;

 

On Fri, Jan 28, 2011 at 4:52 PM, phil young <phil.wills.young@gmail.com> wrote:<= o:p>

This can be accomplished with a cus= tom input format.

 =

Here's a snippet of the relevant = code in the customer RecordReader

   =          

 

            compressionCodecs =3D new Compre= ssionCodecFactory(jobConf);

        &n= bsp;   Path file =3D split.getPath();

    &= nbsp;       final CompressionCodec codec =3D compressionCode= cs.getCodec(file);

          &nbs= p; // open the file and seek to the start of the split

&nb= sp;           start =3D split.getStart();

            end =3D start + split.ge= tLength();

            pos= =3D0;

 

        =     FileSystem fs =3D file.getFileSystem(jobConf);

=

            fsdat =3D fs.open(split.getPat= h());

            fsdat.seek= (start);

 

      &nbs= p;     if (codec !=3D null)

     =       {

         =       fsin =3D codec.createInputStream(fsdat);

            }

  &n= bsp;         else

     =       {

         =       fsin =3D fsdat;

     =       }

 

 = ;

 

 

 

On Fri, Jan 28, 2011 at 1:57 PM, Christopher, Pat <= ;patrick.ch= ristopher@hp.com> wrote:

Hi,

I’ve written a SerDe and I’d like it to be able ha= ndle compressed data (gzip).  Hadoop detects and decompresses on the f= ly so if you have a compressed data set and you don’t need to perform= any custom interpretation of it as you go, hadoop and hive will handle it.=   Is there a way to get Hive to notice the data is compressed, decompr= ess it then push it through the custom SerDe?  Or will I have to eithe= r

  a. add some decompression logic to my S= erDe (possibly impossible)

  b. decompress the da= ta before pushing it into a table with my SerDe

 =

Thanks!

 

Pat

 

 

= --_000_FB0065ADBE630748A6D362671F0DE4E725A7775E5EGVW0432EXBame_--