Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of spragues@gmail.com designates
 209.85.218.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <f6612d80af8a4f3f8d142b9c62c46c0d@BY2PR06MB485.namprd06.prod.outlook.com>
References: 
 <f6612d80af8a4f3f8d142b9c62c46c0d@BY2PR06MB485.namprd06.prod.outlook.com>
From: Stephen Sprague <spragues@gmail.com>
Date: Thu, 14 Aug 2014 15:34:12 -0700
Message-ID: 
 <CAC06LGZGio03Snbn-3XYMA6UhiH8LkeJC+RiOR72YY7MNPxqng@mail.gmail.com>
Subject: Re: Altering the Metastore on EC2
To: "user@hive.apache.org" <user@hive.apache.org>
Content-Type: multipart/alternative; boundary=089e0158aaa63169e705009e821a

--089e0158aaa63169e705009e821a
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

i'll take a stab at this.

- probably no reason.

- if you can. is there a derby client s/t you can issue the command: "alter
table COLUMNS_V2 modify TYPE_NAME varchar(32672)". otherwise maybe use the
mysql or postgres metastores (instead of derby) and run that alter command
after the install.

- the schema only exists in one place and that's the metastore (which is
probably on your namenode for derby.) for mysql or postgres it can be
anywhere you want but again examples will probably show localhost (the
namenode)

that's a mighty big schema! you don't just want to use string type and use
get_json_object to pull data out of it dynamically? not as elegant as using
static syntax like nested structs but its better than nothing. something to
think about anyway.

i'm guessing given a nested struct that large you'll get over one hump only
to be faced with another one. hive needs to do some crazy mapping there for
every record. hopefully that's optimized. :)

Good luck! I'd be curious how it goes.


On Mon, Aug 11, 2014 at 5:52 PM, David Beveridge <dbeveridge@cylance.com>
wrote:

>  We are creating an Hive schema for reading massive JSON files. Our JSON
> schema is rather large, and we have found that the default metastore sche=
ma
> for Hive cannot work for us as-is.
>
> To be specific, one field in our schema has about 17KB of nested structs
> within it. Unfortunately, it appears that Hive has a limit of varchar(400=
0)
> for the field that stores the resulting definition:
>
>
>
>     CREATE TABLE "COLUMNS_V2" (
>
>     "CD_ID" bigint NOT NULL,
>
>     "COMMENT" varchar(4000),
>
>     "COLUMN_NAME" varchar(128) NOT NULL,
>
>     "TYPE_NAME" varchar(4000),
>
>     "INTEGER_IDX" INTEGER NOT NULL,
>
>     PRIMARY KEY ("CD_ID", "COLUMN_NAME")
>
>     );
>
>
>
> We are running this on Amazon MapReduce (v0.11 with default Derby
> metastore)
>
>
>
> So, our initial questions are:
>
> =C2=B7         Is there a reason that the TYPE_NAME is being limited to 4=
000
> (IIUC, varchar on derby can grow to 32672, which would be sufficient for
> a long time)
>
> =C2=B7         Can we alter the metastore schema without hacking/reinstal=
ling
> Hive? (if so, how?)
>
> =C2=B7         If so, is there a proper way to update the schema on all n=
odes?
>
>
>
>
>
> Thanks in advance!
>
> --DB
>

--089e0158aaa63169e705009e821a
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:courier =
new,monospace">i&#39;ll take a stab at this.<br><br></div><div class=3D"gma=
il_default" style=3D"font-family:courier new,monospace">- probably no reaso=
n.<br><br>

</div><div class=3D"gmail_default" style=3D"font-family:courier new,monospa=
ce">- if you can. is there a derby client s/t you can issue the command: &q=
uot;alter table COLUMNS_V2 modify TYPE_NAME varchar(32672)&quot;. otherwise=
 maybe use the mysql or postgres metastores (instead of derby) and run that=
 alter command after the install.<br>

<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">- the schema only exists in one place and that&#39;s the metastore =
(which is probably on your namenode for derby.) for mysql or postgres it ca=
n be anywhere you want but again examples will probably show localhost (the=
 namenode)<br>

<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">that&#39;s a mighty big schema! you don&#39;t just want to use stri=
ng type and use get_json_object to pull data out of it dynamically? not as =
elegant as using static syntax like nested structs but its better than noth=
ing. something to think about anyway.<br>

<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">i&#39;m guessing given a nested struct that large you&#39;ll get ov=
er one hump only to be faced with another one. hive needs to do some crazy =
mapping there for every record. hopefully that&#39;s optimized. :)<br>

<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">Good luck! I&#39;d be curious how it goes.<br></div></div><div clas=
s=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Mon, Aug 11, 2014 a=
t 5:52 PM, David Beveridge <span dir=3D"ltr">&lt;<a href=3D"mailto:dbeverid=
ge@cylance.com" target=3D"_blank">dbeveridge@cylance.com</a>&gt;</span> wro=
te:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div link=3D"#0563C1" vlink=3D"#954F72" lang=3D"EN-US">
<div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt"><span style=3D"font-s=
ize:12.0pt;font-family:&quot;Times New Roman&quot;,&quot;serif&quot;">We ar=
e creating an Hive schema for reading massive JSON files. Our JSON schema i=
s rather large, and we have found that the default metastore
 schema for Hive cannot work for us as-is.<br>
<br>
To be specific, one field in our schema has about 17KB of nested structs wi=
thin it. Unfortunately, it appears that Hive has a limit of varchar(4000) f=
or the field that stores the resulting definition:<u></u><u></u></span></p>


<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;"><u></u>=C2=A0<u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">=C2=A0=C2=A0=C2=A0 CREATE TABLE
<span style=3D"color:red">&quot;COLUMNS_V2&quot;</span> (<u></u><u></u></sp=
an></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">=C2=A0=C2=A0=C2=A0
<span style=3D"color:red">&quot;CD_ID&quot;</span> bigint NOT NULL,<u></u><=
u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">=C2=A0=C2=A0=C2=A0
<span style=3D"color:red">&quot;COMMENT&quot;</span> varchar(4000),<u></u><=
u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">=C2=A0=C2=A0=C2=A0
<span style=3D"color:red">&quot;COLUMN_NAME&quot;</span> varchar(128) NOT N=
ULL,<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">=C2=A0=C2=A0=C2=A0
<span style=3D"color:red">&quot;TYPE_NAME&quot;</span> varchar(4000),<u></u=
><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">=C2=A0=C2=A0=C2=A0
<span style=3D"color:red">&quot;INTEGER_IDX&quot;</span> INTEGER NOT NULL,<=
u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">=C2=A0=C2=A0=C2=A0 PRIMARY KEY (<span style=3D"color:red">=
&quot;CD_ID&quot;</span>,
<span style=3D"color:red">&quot;COLUMN_NAME&quot;</span>)<u></u><u></u></sp=
an></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">=C2=A0=C2=A0=C2=A0 );<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,&quot;serif&quot;"><u></u>=C2=A0<u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:12.0pt;font-family:&quot;Ti=
mes New Roman&quot;,&quot;serif&quot;">We are running this on Amazon MapRed=
uce (v0.11 with default Derby metastore)<u></u><u></u></span></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">So, our initial questions are: <u></u><u></u></p>
<p><u></u><span style=3D"font-family:Symbol"><span>=C2=B7<span style=3D"fon=
t:7.0pt &quot;Times New Roman&quot;">=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0
</span></span></span><u></u>Is there a reason that the TYPE_NAME is being l=
imited to 4000 (IIUC, varchar on derby can grow to
<span>32672, which would be sufficient for a long time</span>)<u></u><u></u=
></p>
<p><u></u><span style=3D"font-family:Symbol"><span>=C2=B7<span style=3D"fon=
t:7.0pt &quot;Times New Roman&quot;">=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0
</span></span></span><u></u>Can we alter the metastore schema without hacki=
ng/reinstalling Hive? (if so, how?)<u></u><u></u></p>
<p><u></u><span style=3D"font-family:Symbol"><span>=C2=B7<span style=3D"fon=
t:7.0pt &quot;Times New Roman&quot;">=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0
</span></span></span><u></u>If so, is there a proper way to update the sche=
ma on all nodes?<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">Thanks in advance!<span class=3D"HOEnZb"><font color=
=3D"#888888"><u></u><u></u></font></span></p><span class=3D"HOEnZb"><font c=
olor=3D"#888888">
<p class=3D"MsoNormal">--DB<u></u><u></u></p>
</font></span></div>
</div>

</blockquote></div><br></div>

--089e0158aaa63169e705009e821a--