Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5DE4810705 for ; Tue, 8 Oct 2013 18:43:44 +0000 (UTC) Received: (qmail 72569 invoked by uid 500); 8 Oct 2013 18:43:40 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 72478 invoked by uid 500); 8 Oct 2013 18:43:39 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 72423 invoked by uid 99); 8 Oct 2013 18:43:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Oct 2013 18:43:36 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Sanjay.Subramanian@wizecommerce.com designates 207.46.163.152 as permitted sender) Received: from [207.46.163.152] (HELO na01-bn1-obe.outbound.protection.outlook.com) (207.46.163.152) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Oct 2013 18:43:30 +0000 Received: from BN1PR04MB423.namprd04.prod.outlook.com (10.141.58.152) by BN1PR04MB422.namprd04.prod.outlook.com (10.141.58.150) with Microsoft SMTP Server (TLS) id 15.0.785.10; Tue, 8 Oct 2013 18:43:08 +0000 Received: from BN1PR04MB423.namprd04.prod.outlook.com ([169.254.12.92]) by BN1PR04MB423.namprd04.prod.outlook.com ([169.254.12.92]) with mapi id 15.00.0785.001; Tue, 8 Oct 2013 18:43:08 +0000 From: Sanjay Subramanian To: "user@hive.apache.org" Subject: Re: JSON format files versus AVRO Thread-Topic: JSON format files versus AVRO Thread-Index: AQHOw7amEkPJHJkovkqQawDvpZtOOpnrI+EA//+LzoA= Date: Tue, 8 Oct 2013 18:43:07 +0000 Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [216.185.210.251] x-forefront-prvs: 0993689CD1 x-forefront-antispam-report: SFV:NSPM;SFS:(189002)(199002)(61754003)(377454003)(24454002)(63696002)(85306002)(79102001)(76482001)(56776001)(54316002)(77982001)(69226001)(572594001)(74366001)(81342001)(81542001)(74502001)(31966008)(74662001)(47446002)(80976001)(81816001)(81686001)(36756003)(16236675002)(76176001)(83322001)(76796001)(76786001)(19580405001)(83072001)(53806001)(54356001)(4396001)(59766001)(74876001)(51856001)(46102001)(19580395003)(77096001)(56816003)(47976001)(50986001)(66066001)(65816001)(80022001)(74706001)(47736001)(49866001)(567094001)(24704002);DIR:OUT;SFP:;SCL:1;SRVR:BN1PR04MB422;H:BN1PR04MB423.namprd04.prod.outlook.com;CLIP:216.185.210.251;FPR:;RD:InfoNoRecords;MX:1;A:1;LANG:en; Content-Type: multipart/alternative; boundary="_000_CE799FB7C1C9sanjaysubramanianwizecommercecom_" MIME-Version: 1.0 X-OriginatorOrg: wizecommerce.com X-Virus-Checked: Checked by ClamAV on apache.org --_000_CE799FB7C1C9sanjaysubramanianwizecommercecom_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi Thanks I have to still check out JsonSerDe in catalog. U r right an I did think about adding the unique key as an attribute inside= the JSON Instead of analyzing further I am going to try both methods out and see how= my down the stream processes will work. I have a 40 step Oozie workflow t= hat needs to be successful after all this :-) Cool thanks Thanks Regards sanjay email : sanjay.subramanian@wizecommerce.com From: Sushanth Sowmyan > Reply-To: "user@hive.apache.org" > Date: Tuesday, October 8, 2013 11:39 AM To: "user@hive.apache.org" > Subject: Re: JSON format files versus AVRO Have you had a look at the JsonSerDe in hcatalog to see if it suits your ne= ed? It does not support the format you are suggesting directly, but if you made= the unique I'd part of the json object, so that each line was a json recor= d, it would. It's made to be used in conjunction with text tables. Also, even if it proves to not be what you want directly, it already provid= es a serializer/deserializer On Oct 7, 2013 4:41 PM, "Sanjay Subramanian" > wrote: Sorry if the subject sounds really stupid ! Basically I am re-architecting our web log record format Currently we have "Multiple lines =3D 1 Record " format (I have Hadoop jobs= that parse the files and create columnar output for Hive tables) [begin_unique_id] Pipe delimited Blah.................... Pipe delimited Blah.................... Pipe delimited Blah.................... Pipe delimited Blah.................... Pipe delimited Blah.................... [end_unique_id] I have created JSON serializers that will log records in the following way = going forward This is the plan - I will store the records in a two column table in Hive - Write JSON deserializers in hive HDFs that will take these tables and cr= eate hive tables pertaining to specific requirements - Modify current aggregation scripts in Hive I was seeing AVRO format but I don't see the value of using AVO when I feel= JSON gives me pretty much the same thing ? Please poke holes in my thinking ! Rip me apart ! Thanks Regards sanjay CONFIDENTIALITY NOTICE =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D This email message and any attachments are for the exclusive use of the int= ended recipient(s) and may contain confidential and privileged information.= Any unauthorized review, use, disclosure or distribution is prohibited. If= you are not the intended recipient, please contact the sender by reply ema= il and destroy all copies of the original message along with any attachment= s, from your computer system. If you are the intended recipient, please be = advised that the content of this message is subject to access, review and d= isclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D This email message and any attachments are for the exclusive use of the int= ended recipient(s) and may contain confidential and privileged information.= Any unauthorized review, use, disclosure or distribution is prohibited. If= you are not the intended recipient, please contact the sender by reply ema= il and destroy all copies of the original message along with any attachment= s, from your computer system. If you are the intended recipient, please be = advised that the content of this message is subject to access, review and d= isclosure by the sender's Email System Administrator. --_000_CE799FB7C1C9sanjaysubramanianwizecommercecom_ Content-Type: text/html; charset="iso-8859-1" Content-ID: <938838D6FB09E9458792A8B034C83788@namprd04.prod.outlook.com> Content-Transfer-Encoding: quoted-printable
Hi 
Thanks I have= to still check out  JsonSerDe in catalog.<= /div>
U r right an I did think about adding the unique key as an attribute in= side the JSON
Instead of an= alyzing further I am going to try both methods out and see how my down the = stream processes will work.  I have a 40 step Oozie workflow that need= s to be successful after all this :-) 
Cool thanks

Thanks
Regards

sanjay

email : sanjay.subramanian@wizecommerce.com

From: Sushanth Sowmyan <khorgath@gmail.com>
Reply-To: "user@hive.apache.org" <user@hive.apache.org>
Date: Tuesday, October 8, 2013 11:3= 9 AM
To: "user@hive.apache.org" <user@hive.apache.org>
Subject: Re: JSON format files vers= us AVRO

Have you had a look at the JsonSerDe in hcatalog to see if i= t suits your need?

It does not support the format you are suggesting directly, = but if you made the unique I'd part of the json object, so that each line w= as a json record, it would. It's made to be used in conjunction with text t= ables.

Also, even if it proves to not be what you want directly, it= already provides a serializer/deserializer

On Oct 7, 2013 4:41 PM, "Sanjay Subramanian= " <Sanjay.Su= bramanian@wizecommerce.com> wrote:
Sorry if the = subject sounds really stupid !

Basically I a= m re-architecting our web log record format  

Currently we = have "Multiple lines =3D 1 Record " format (I have Hadoop jobs th= at parse the files and create columnar output for Hive tables)

[begin_unique_id]
Pipe delimited Blah………̷= 0;……..
Pipe delimited Blah………̷= 0;……..
Pipe delimited Blah………̷= 0;……..
Pipe delimited Blah………̷= 0;……..
Pipe delimited Blah………̷= 0;……..
[end_unique_id]
 

I have create= d JSON serializers that will log records in the following way going forward=
<unique_id>     <JSON-stri= ng> 

This is the plan 
- I will store the records in a two column table in Hive<= /div>
- Write JSON deserializers in hive HDFs that will take these tab= les and  create hive tables pertaining to specific requirements=
- Modify current aggregation scripts in Hive 

I was seeing AVRO format but I don't see the value of using AVO when I= feel JSON gives me pretty much the same thing ? 

Please poke holes in my thinking ! Rip me apart ! 


Thanks
Regards

sanjay



CONFIDENTIALITY NOTICE
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
This email message and any attachments are for the exclusive use of the int= ended recipient(s) and may contain confidential and privileged information.= Any unauthorized review, use, disclosure or distribution is prohibited. If= you are not the intended recipient, please contact the sender by reply email and destroy all copies of the ori= ginal message along with any attachments, from your computer system. If you= are the intended recipient, please be advised that the content of this mes= sage is subject to access, review and disclosure by the sender's Email System Administrator.

CONFIDENTIALITY NOTICE
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
This email message and any attachments are for the exclusive use of the int= ended recipient(s) and may contain confidential and privileged information.= Any unauthorized review, use, disclosure or distribution is prohibited. If= you are not the intended recipient, please contact the sender by reply email and destroy all copies of the ori= ginal message along with any attachments, from your computer system. If you= are the intended recipient, please be advised that the content of this mes= sage is subject to access, review and disclosure by the sender's Email System Administrator.
--_000_CE799FB7C1C9sanjaysubramanianwizecommercecom_--