Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6CC1810689 for ; Wed, 2 Apr 2014 13:53:58 +0000 (UTC) Received: (qmail 22373 invoked by uid 500); 2 Apr 2014 13:53:55 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 21743 invoked by uid 500); 2 Apr 2014 13:53:53 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 21581 invoked by uid 99); 2 Apr 2014 13:53:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Apr 2014 13:53:51 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dquigley89@gmail.com designates 209.85.217.182 as permitted sender) Received: from [209.85.217.182] (HELO mail-lb0-f182.google.com) (209.85.217.182) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Apr 2014 13:53:45 +0000 Received: by mail-lb0-f182.google.com with SMTP id n15so178524lbi.41 for ; Wed, 02 Apr 2014 06:53:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=YqerOp+Aq1bEUaVFb8sAHt/3gJSJoLgfepjbzUm8CVo=; b=unvHdcLQ9QtBVuKxm0qhkBhL3jTr5SmJrWEX7AVUsmxjiOTdppQaSMzFq9/S97Tuuu 8ORh6DlPVrEMtacKCNggccClFFFXGVwnL7HT6LeCjGyE33DeCN+9wlGYdn8ilU/M/oPs UDo3B2mkgHOCqcB9RXAUPHQuaVwVtdlgVn0iGMipa//vTuS10g40HjM6HYFam2ks28n0 Vj2r3wgNiyS4sBBHFrFZvqafiir+fEUI4Y0FPHV2sWmPk7i5IinbYvMbBMxkxMMlwfWO gAn4HbpV9D4J8/o2cZvHJcmszls9jto3bcUpTtw3ufgJt0jRiryoksEI6GhA7J6rjwI4 livA== MIME-Version: 1.0 X-Received: by 10.152.37.99 with SMTP id x3mr66590laj.7.1396446803889; Wed, 02 Apr 2014 06:53:23 -0700 (PDT) Received: by 10.152.21.71 with HTTP; Wed, 2 Apr 2014 06:53:23 -0700 (PDT) In-Reply-To: References: Date: Wed, 2 Apr 2014 06:53:23 -0700 Message-ID: Subject: Re: Deserializing into multiple records From: David Quigley To: user@hive.apache.org Content-Type: multipart/alternative; boundary=089e0158b87cb192f504f60f9b3f X-Virus-Checked: Checked by ClamAV on apache.org --089e0158b87cb192f504f60f9b3f Content-Type: text/plain; charset=ISO-8859-1 Makes perfect sense, thanks Petter! On Wed, Apr 2, 2014 at 2:15 AM, Petter von Dolwitz (Hem) < petter.von.dolwitz@gmail.com> wrote: > Hi David, > > you can implement a custom InputFormat (extends > org.apache.hadoop.mapred.FileInputFormat) accompanied by a custom > RecordReader (implements org.apache.hadoop.mapred.RecordReader). The > RecordReader will be used to read your documents and from there you can > decide which units you will return as records (return by the next() > method). You'll still probably need a SerDe that transforms your data into > Hive data types using 1:1 mapping. > > In this way you can choose only to duplicate your data while your query > runs (and possible in the results) to avoid JOIN operations but the raw > files will not contain duplicate data. > > Something like this: > > CREATE EXTERNAL TABLE IF NOT EXISTS MyTable ( > myfield1 STRING, > myfield2 INT) > PARTITIONED BY (your_partition_if_appliccable STRING) > ROW FORMAT SERDE 'quigley.david.myserde' > STORED AS INPUTFORMAT 'quigley.david.myinputformat' OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION 'mylocation'; > > > Hope this helps. > > Br, > Petter > > > > > 2014-04-02 5:45 GMT+02:00 David Quigley : > > We are currently streaming complex documents to hdfs with the hope of >> being able to query. Each single document logically breaks down into a set >> of individual records. In order to use Hive, we preprocess each input >> document into a set of discreet records, which we save on HDFS and create >> an external table on top of. >> >> This approach works, but we end up duplicating a lot of data in the >> records. It would be much more efficient to deserialize the document into a >> set of records when a query is made. That way, we can just save the raw >> documents on HDFS. >> >> I have looked into writing a cusom SerDe. >> >> Object >> *deserialize*(org.apache.hadoop.io.Writable blob) >> >> It looks like the input record => deserialized record still needs to be a >> 1:1 relationship. Is there any way to deserialize a record into multiple >> records? >> >> Thanks, >> Dave >> > > --089e0158b87cb192f504f60f9b3f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Makes perfect sense, thanks Petter!


On Wed, Apr 2, 2014 at 2:15 AM= , Petter von Dolwitz (Hem) <petter.von.dolwitz@gmail.com>= ; wrote:
Hi= David,

you can implement a custom InputFormat (extends org.ap= ache.hadoop.mapred.FileInputFormat) accompanied by a custom RecordReader (i= mplements org.apache.hadoop.mapred.RecordReader). The RecordReader will be = used to read your documents and from there you can decide which units you w= ill return as records (return by the next() method). You'll still proba= bly need a SerDe that transforms your data into Hive data types using 1:1 m= apping.

In this way you can choose only to duplicate your data while your= query runs (and possible in the results) to avoid JOIN operations but the = raw files will not contain duplicate data.

Something like this= :

CREATE EXTERNAL TABLE IF NOT EXISTS MyTable (
=A0 myfield1 STRING,=A0 myfield2 INT)
=A0 PARTITIONED BY (your_partition_if_appliccable ST= RING)
=A0 ROW FORMAT SERDE 'quigley.david.myserde'
=A0 STORED= AS INPUTFORMAT 'quigley.david.myinputformat' OUTPUTFORMAT 'org= .apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
=A0 LOCATION 'mylocation';


Hope this helps.
Br,
Petter




2014-0= 4-02 5:45 GMT+02:00 David Quigley <dquigley89@gmail.com>:=

We are currently streaming complex documen= ts to hdfs with the hope of being able to query. Each single document logic= ally breaks down into a set of individual records. In order to use Hive, we= preprocess each input document into a set of discreet records, which we sa= ve on HDFS and create an external table on top of.

This a= pproach works, but we end up duplicating a lot of data in the records. It w= ould be much more efficient to deserialize the document into a set of recor= ds when a query is made. That way, we can just save the raw documents on HD= FS.=A0

I have looked into wri= ting a cusom SerDe.=A0

Objec= t=A0deserialize(org.apache.hadoop.io.Writable=A0blob)

It looks like the input record =3D> deserialized record still needs to= be a 1:1 relationship. Is there any way to deserialize a record into multi= ple records?

Thanks,
Dave


--089e0158b87cb192f504f60f9b3f--