Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of viral.bajaria@gmail.com
 designates 209.85.213.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALckxSPbo1hGn2sU3koDsz1ecWx5aNxcSko5qUKWQvkqX+kC7A@mail.gmail.com>
References: 
 <CALckxSPbo1hGn2sU3koDsz1ecWx5aNxcSko5qUKWQvkqX+kC7A@mail.gmail.com>
Date: Sun, 30 Nov 2014 22:44:21 -0800
Message-ID: 
 <CALckxSNMMd5dX9RctwdfsDujuHfUgdtKv=mx2ngRDgVXkRh_ig@mail.gmail.com>
Subject: Re: issue with hive wide tables/views
From: Viral Bajaria <viral.bajaria@gmail.com>
To: "user@hive.apache.org" <user@hive.apache.org>
Content-Type: multipart/alternative; boundary=001a113ebbf8c26d90050921f0eb

--001a113ebbf8c26d90050921f0eb
Content-Type: text/plain; charset=UTF-8

Any help will be appreciate here.

This issue becomes a bigger pain when you have a VIEW referencing another
VIEW(s) which have 1000s of columns.

It seems the generation of the query plan has some un-optimized code path
when there are 1000s of columns.

A jstack of a running process ( > 30 minutes ) shows this:
https://gist.github.com/vbajaria/2b46eb015eb5f97954fc

I ran jstack multiple times on the running process and everytime the stack
trace of the SemanticAnalyzer propped up with the same results, hence I am
guessing that the underlying issue could be in there.

Let me know if any more details are needed to get any help on this. Will it
benefit if I reached out to the dev list for this ?

Thanks,
Viral


On Wed, Nov 26, 2014 at 11:21 AM, Viral Bajaria <viral.bajaria@gmail.com>
wrote:

> Hi,
>
> I have a table which ended up having 3K+ columns. The building of the
> table wasn't that painful, but the part where things suck is when creating
> VIEWs on top of that table.
>
> 1 of the views that I want to create needs complex operation and
> references a ton of columns or almost all of the columns.
>
> When applying this view to hive, it takes over 25 minutes for the view
> definition to get applied. Acceptable if the view didn't need frequent
> updates, but not acceptable if we plan to change the view often or have
> multiple such views.
>
> So the questions:
> 1) Should it take so long for hive to create a view that has so many
> columns ? If not, should we open a JIRA and investigate this issue ?
> 2) The underlying tables are CSV (raw data) or ORC (after some
> processing)... would we benefit if we change it from 3K+ columns to a
> single column containing List<Object> column or Map<String, Object> for all
> the values and then use the required columns
>
> We are on Hive 0.13.0 and our metastore is backed by MariaDB 10
>
> Thanks,
> Viral
>
>

--001a113ebbf8c26d90050921f0eb
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Any help will be appreciate here.<div><br></div><div>This =
issue becomes a bigger pain when you have a VIEW referencing another VIEW(s=
) which have 1000s of columns.</div><div><br></div><div>It seems the genera=
tion of the query plan has some un-optimized code path when there are 1000s=
 of columns.</div><div><br></div><div>A jstack of a running process ( &gt; =
30 minutes ) shows this:=C2=A0<a href=3D"https://gist.github.com/vbajaria/2=
b46eb015eb5f97954fc">https://gist.github.com/vbajaria/2b46eb015eb5f97954fc<=
/a></div><div><br></div><div>I ran jstack multiple times on the running pro=
cess and everytime the stack trace of the SemanticAnalyzer propped up with =
the same results, hence I am guessing that the underlying issue could be in=
 there.</div><div><br></div><div>Let me know if any more details are needed=
 to get any help on this. Will it benefit if I reached out to the dev list =
for this ?</div><div><br></div><div>Thanks,<br></div><div>Viral</div><div><=
br></div><div><br></div></div><div class=3D"gmail_extra"><br><div class=3D"=
gmail_quote">On Wed, Nov 26, 2014 at 11:21 AM, Viral Bajaria <span dir=3D"l=
tr">&lt;<a href=3D"mailto:viral.bajaria@gmail.com" target=3D"_blank">viral.=
bajaria@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><=
div dir=3D"ltr">Hi,<div><br></div><div>I have a table which ended up having=
 3K+ columns. The building of the table wasn&#39;t that painful, but the pa=
rt where things suck is when creating VIEWs on top of that table.</div><div=
><br></div><div>1 of the views that I want to create needs complex operatio=
n and references a ton of columns or almost all of the columns.</div><div><=
br></div><div>When applying this view to hive, it takes over 25 minutes for=
 the view definition to get applied. Acceptable if the view didn&#39;t need=
 frequent updates, but not acceptable if we plan to change the view often o=
r have multiple such views.</div><div><br></div><div>So the questions:</div=
><div>1) Should it take so long for hive to create a view that has so many =
columns ? If not, should we open a JIRA and investigate this issue ?</div><=
div>2) The underlying tables are CSV (raw data) or ORC (after some processi=
ng)... would we benefit if we change it from 3K+ columns to a single column=
 containing List&lt;Object&gt; column or Map&lt;String, Object&gt; for all =
the values and then use the required columns=C2=A0</div><div><br></div><div=
>We are on Hive 0.13.0 and our metastore is backed by MariaDB 10</div><div>=
<br></div><div>Thanks,</div><div>Viral</div><div><br></div></div>
</blockquote></div><br></div>

--001a113ebbf8c26d90050921f0eb--