Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6721111550 for ; Thu, 14 Aug 2014 22:34:59 +0000 (UTC) Received: (qmail 62873 invoked by uid 500); 14 Aug 2014 22:34:57 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 62805 invoked by uid 500); 14 Aug 2014 22:34:57 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 62795 invoked by uid 99); 14 Aug 2014 22:34:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Aug 2014 22:34:57 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of spragues@gmail.com designates 209.85.218.47 as permitted sender) Received: from [209.85.218.47] (HELO mail-oi0-f47.google.com) (209.85.218.47) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Aug 2014 22:34:53 +0000 Received: by mail-oi0-f47.google.com with SMTP id x69so1195750oia.6 for ; Thu, 14 Aug 2014 15:34:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=WpYAE3OnH6SnMBfmSyKs3skWu9eLjWPTIJKfoEZN8+U=; b=khmXPsmHJXbLHFJidfxhbzhiTIbr3EtNvrNrQ6xisnJ45gH3h92HSQU7EbcSdXf+vC +lAYdryF2WtNqZuFPynL6AA87bR7gHRC9/8vfbfA0tsL6XPFUEgwwNaR5KwgJUK+36Oe 7jU/RZHEYDBDBAX0lO2uzJR0P0upfXu1YmP3bVlYGWLsNxYSCKnWs9PzEJUi3l365x+q BxWxg8+sSuz+O/OlyDxOMfZ2Yt8+U/HbeWiZxIH+QbRq5PRXSjiBUX6WajmCuPZm7TmW byrozDEYUio1xuQ1yYrevag5GjTRgkgqZxwcd3e3tejZ67mfH6V9huRn+abk3Cqn9M2D PbLg== X-Received: by 10.182.52.165 with SMTP id u5mr16685167obo.30.1408055672660; Thu, 14 Aug 2014 15:34:32 -0700 (PDT) MIME-Version: 1.0 Received: by 10.202.222.10 with HTTP; Thu, 14 Aug 2014 15:34:12 -0700 (PDT) In-Reply-To: References: From: Stephen Sprague Date: Thu, 14 Aug 2014 15:34:12 -0700 Message-ID: Subject: Re: Altering the Metastore on EC2 To: "user@hive.apache.org" Content-Type: multipart/alternative; boundary=089e0158aaa63169e705009e821a X-Virus-Checked: Checked by ClamAV on apache.org --089e0158aaa63169e705009e821a Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable i'll take a stab at this. - probably no reason. - if you can. is there a derby client s/t you can issue the command: "alter table COLUMNS_V2 modify TYPE_NAME varchar(32672)". otherwise maybe use the mysql or postgres metastores (instead of derby) and run that alter command after the install. - the schema only exists in one place and that's the metastore (which is probably on your namenode for derby.) for mysql or postgres it can be anywhere you want but again examples will probably show localhost (the namenode) that's a mighty big schema! you don't just want to use string type and use get_json_object to pull data out of it dynamically? not as elegant as using static syntax like nested structs but its better than nothing. something to think about anyway. i'm guessing given a nested struct that large you'll get over one hump only to be faced with another one. hive needs to do some crazy mapping there for every record. hopefully that's optimized. :) Good luck! I'd be curious how it goes. On Mon, Aug 11, 2014 at 5:52 PM, David Beveridge wrote: > We are creating an Hive schema for reading massive JSON files. Our JSON > schema is rather large, and we have found that the default metastore sche= ma > for Hive cannot work for us as-is. > > To be specific, one field in our schema has about 17KB of nested structs > within it. Unfortunately, it appears that Hive has a limit of varchar(400= 0) > for the field that stores the resulting definition: > > > > CREATE TABLE "COLUMNS_V2" ( > > "CD_ID" bigint NOT NULL, > > "COMMENT" varchar(4000), > > "COLUMN_NAME" varchar(128) NOT NULL, > > "TYPE_NAME" varchar(4000), > > "INTEGER_IDX" INTEGER NOT NULL, > > PRIMARY KEY ("CD_ID", "COLUMN_NAME") > > ); > > > > We are running this on Amazon MapReduce (v0.11 with default Derby > metastore) > > > > So, our initial questions are: > > =C2=B7 Is there a reason that the TYPE_NAME is being limited to 4= 000 > (IIUC, varchar on derby can grow to 32672, which would be sufficient for > a long time) > > =C2=B7 Can we alter the metastore schema without hacking/reinstal= ling > Hive? (if so, how?) > > =C2=B7 If so, is there a proper way to update the schema on all n= odes? > > > > > > Thanks in advance! > > --DB > --089e0158aaa63169e705009e821a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
i'll take a stab at this.

- probably no reaso= n.

- if you can. is there a derby client s/t you can issue the command: &q= uot;alter table COLUMNS_V2 modify TYPE_NAME varchar(32672)". otherwise= maybe use the mysql or postgres metastores (instead of derby) and run that= alter command after the install.

- the schema only exists in one place and that's the metastore = (which is probably on your namenode for derby.) for mysql or postgres it ca= n be anywhere you want but again examples will probably show localhost (the= namenode)

that's a mighty big schema! you don't just want to use stri= ng type and use get_json_object to pull data out of it dynamically? not as = elegant as using static syntax like nested structs but its better than noth= ing. something to think about anyway.

i'm guessing given a nested struct that large you'll get ov= er one hump only to be faced with another one. hive needs to do some crazy = mapping there for every record. hopefully that's optimized. :)

Good luck! I'd be curious how it goes.


On Mon, Aug 11, 2014 a= t 5:52 PM, David Beveridge <dbeveridge@cylance.com> wro= te:

We ar= e creating an Hive schema for reading massive JSON files. Our JSON schema i= s rather large, and we have found that the default metastore schema for Hive cannot work for us as-is.

To be specific, one field in our schema has about 17KB of nested structs wi= thin it. Unfortunately, it appears that Hive has a limit of varchar(4000) f= or the field that stores the resulting definition:

=C2=A0

=C2=A0=C2=A0=C2=A0 CREATE TABLE "COLUMNS_V2" (

=C2=A0=C2=A0=C2=A0 "CD_ID" bigint NOT NULL,<= u>

=C2=A0=C2=A0=C2=A0 "COMMENT" varchar(4000),<= u>

=C2=A0=C2=A0=C2=A0 "COLUMN_NAME" varchar(128) NOT N= ULL,

=C2=A0=C2=A0=C2=A0 "TYPE_NAME" varchar(4000),

=C2=A0=C2=A0=C2=A0 "INTEGER_IDX" INTEGER NOT NULL,<= u>

=C2=A0=C2=A0=C2=A0 PRIMARY KEY (= "CD_ID", "COLUMN_NAME")

=C2=A0=C2=A0=C2=A0 );

=C2=A0

We are running this on Amazon MapRed= uce (v0.11 with default Derby metastore)

=C2=A0

So, our initial questions are:

=C2=B7=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 Is there a reason that the TYPE_NAME is being l= imited to 4000 (IIUC, varchar on derby can grow to 32672, which would be sufficient for a long time)

=C2=B7=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 Can we alter the metastore schema without hacki= ng/reinstalling Hive? (if so, how?)

=C2=B7=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 If so, is there a proper way to update the sche= ma on all nodes?

=C2=A0

=C2=A0

Thanks in advance!

--DB


--089e0158aaa63169e705009e821a--