Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <370FC482-EBE4-4E69-AD7D-713A12075377@videoamp.com>
References: <CAC2sYS0Soi4-q+DdWFgm_0m6EhSUx_7ARrnos0v148xpuaD5Ow@mail.gmail.com>
 <CAC2sYS0HdQ6q=nufwz66hcLpPM=5OQyt7MVA1FX+sJmiZV8umQ@mail.gmail.com>
 <CAC2sYS3VS+P4nz-PUvtsS4MXQxONtK2ymPV80to61_hXQzXAtQ@mail.gmail.com>
 <CAC2sYS0=fkNTU7ojzH4WMoU_2KqEurf1k4DOu9uCMpOpbcfVvQ@mail.gmail.com>
 <CAC2sYS21DqpGWmNSKHYO0rw6Q39tteHdGUqo7yJoBfHXWh=zKQ@mail.gmail.com> <370FC482-EBE4-4E69-AD7D-713A12075377@videoamp.com>
From: Raju Bairishetti <raju@apache.org>
Date: Wed, 18 Jan 2017 10:51:56 +0800
Message-ID: <CAC2sYS1ymP8XkuYxf7+jRBix409K6hBx9=cT7nGLC0=7stTyYg@mail.gmail.com>
Subject: Re: Spark sql query plan contains all the partitions from hive table
 even though filtering of partitions is provided
To: Michael Allman <michael@videoamp.com>
Cc: dev@spark.apache.org
Content-Type: multipart/alternative; boundary=001a11442fb26488b80546558293
archived-at: Wed, 18 Jan 2017 02:52:46 -0000

--001a11442fb26488b80546558293
Content-Type: text/plain; charset=UTF-8

Thanks Michael for the respopnse.


On Wed, Jan 18, 2017 at 2:45 AM, Michael Allman <michael@videoamp.com>
wrote:

> Hi Raju,
>
> I'm sorry this isn't working for you. I helped author this functionality
> and will try my best to help.
>
> First, I'm curious why you set spark.sql.hive.convertMetastoreParquet to
> false?
>
I had set as suggested in SPARK-6910 and corresponsing pull reqs. It did
not work for me without  setting *spark.sql.hive.convertMetastoreParquet*
property.

Can you link specifically to the jira issue or spark pr you referred to?
> The first thing I would try is setting spark.sql.hive.convertMetastoreParquet
> to true. Setting that to false might also explain why you're getting
> parquet decode errors. If you're writing your table data with Spark's
> parquet file writer and reading with Hive's parquet file reader, there may
> be an incompatibility accounting for the decode errors you're seeing.
>
>  https://issues.apache.org/jira/browse/SPARK-6910 . My main motivation is
to avoid fetching all the partitions. We reverted
spark.sql.hive.convertMetastoreParquet
 setting to true to decoding errors. After reverting this it is fetching
all partiitons from the table.

Can you reply with your table's Hive metastore schema, including partition
> schema?
>
     col1 string
     col2 string
     year int
     month int
     day int
     hour int

# Partition Information

# col_name            data_type           comment

year  int

month int

day int

hour int

venture string

>
>
Where are the table's files located?
>
In hadoop. Under some user directory.

> If you do a "show partitions <dbname>.<tablename>" in the spark-sql shell,
> does it show the partitions you expect to see? If not, run "msck repair
> table <dbname>.<tablename>".
>
Yes. It is listing the partitions

> Cheers,
>
> Michael
>
>
> On Jan 17, 2017, at 12:02 AM, Raju Bairishetti <raju@apache.org> wrote:
>
> Had a high level look into the code. Seems getHiveQlPartitions  method
> from HiveMetastoreCatalog is getting called irrespective of metastorePartitionPruning
> conf value.
>
>  It should not fetch all partitions if we set metastorePartitionPruning to
> true (Default value for this is false)
>
> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] = {
>   val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) {
>     table.getPartitions(predicates)
>   } else {
>     allPartitions
>   }
>
> ...
>
> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] =
>   client.getPartitionsByFilter(this, predicates)
>
> lazy val allPartitions = table.getAllPartitions
>
> But somehow getAllPartitions is getting called eventough after setting metastorePartitionPruning to true.
>
> Am I missing something or looking at wrong place?
>
>
> On Tue, Jan 17, 2017 at 4:01 PM, Raju Bairishetti <raju@apache.org> wrote:
>
>> Hello,
>>
>>    Spark sql is generating query plan with all partitions information
>> even though if we apply filters on partitions in the query.  Due to
>> this, sparkdriver/hive metastore is hitting with OOM as each table is
>> with lots of partitions.
>>
>> We can confirm from hive audit logs that it tries to
>> *fetch all partitions* from hive metastore.
>>
>>  2016-12-28 07:18:33,749 INFO  [pool-4-thread-184]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub    ip=/x.x.x.x
>> cmd=get_partitions : db=xxxx tbl=xxxxx
>>
>>
>> Configured the following parameters in the spark conf to fix the above
>> issue(source: from spark-jira & github pullreq):
>>
>> *spark.sql.hive.convertMetastoreParquet   false*
>> *    spark.sql.hive.metastorePartitionPruning   true*
>>
>>
>> *   plan:  rdf.explain*
>> *   == Physical Plan ==*
>>        HiveTableScan [rejection_reason#626], MetastoreRelation dbname,
>> tablename, None,   [(year#314 = 2016),(month#315 = 12),(day#316 =
>> 28),(hour#317 = 2),(venture#318 = DEFAULT)]
>>
>> *    get_partitions_by_filter* method is called and fetching only
>> required partitions.
>>
>>     But we are seeing parquetDecode errors in our applications frequently
>> after this. Looks like these decoding errors were because of changing
>> serde fromspark-builtin to hive serde.
>>
>> I feel like,* fixing query plan generation in the spark-sql* is the
>> right approach instead of forcing users to use hive serde.
>>
>> Is there any workaround/way to fix this issue? I would like to hear more
>> thoughts on this :)
>>
>>
>> On Tue, Jan 17, 2017 at 4:00 PM, Raju Bairishetti <raju@apache.org>
>> wrote:
>>
>>> Had a high level look into the code. Seems getHiveQlPartitions  method
>>> from HiveMetastoreCatalog is getting called irrespective of metastorePartitionPruning
>>> conf value.
>>>
>>>  It should not fetch all partitions if we set metastorePartitionPruning to
>>> true (Default value for this is false)
>>>
>>> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] = {
>>>   val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) {
>>>     table.getPartitions(predicates)
>>>   } else {
>>>     allPartitions
>>>   }
>>>
>>> ...
>>>
>>> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] =
>>>   client.getPartitionsByFilter(this, predicates)
>>>
>>> lazy val allPartitions = table.getAllPartitions
>>>
>>> But somehow getAllPartitions is getting called eventough after setting metastorePartitionPruning to true.
>>>
>>> Am I missing something or looking at wrong place?
>>>
>>>
>>> On Mon, Jan 16, 2017 at 12:53 PM, Raju Bairishetti <raju@apache.org>
>>> wrote:
>>>
>>>> Waiting for suggestions/help on this...
>>>>
>>>> On Wed, Jan 11, 2017 at 12:14 PM, Raju Bairishetti <raju@apache.org>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>>    Spark sql is generating query plan with all partitions information
>>>>> even though if we apply filters on partitions in the query.  Due to this,
>>>>> spark driver/hive metastore is hitting with OOM as each table is with lots
>>>>> of partitions.
>>>>>
>>>>> We can confirm from hive audit logs that it tries to *fetch all
>>>>> partitions* from hive metastore.
>>>>>
>>>>>  2016-12-28 07:18:33,749 INFO  [pool-4-thread-184]:
>>>>> HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(371)) -
>>>>> ugi=rajub    ip=/x.x.x.x   cmd=get_partitions : db=xxxx tbl=xxxxx
>>>>>
>>>>>
>>>>> Configured the following parameters in the spark conf to fix the above
>>>>> issue(source: from spark-jira & github pullreq):
>>>>>
>>>>> *spark.sql.hive.convertMetastoreParquet   false*
>>>>> *    spark.sql.hive.metastorePartitionPruning   true*
>>>>>
>>>>>
>>>>> *   plan:  rdf.explain*
>>>>> *   == Physical Plan ==*
>>>>>        HiveTableScan [rejection_reason#626], MetastoreRelation dbname,
>>>>> tablename, None,   [(year#314 = 2016),(month#315 = 12),(day#316 =
>>>>> 28),(hour#317 = 2),(venture#318 = DEFAULT)]
>>>>>
>>>>> *    get_partitions_by_filter* method is called and fetching only
>>>>> required partitions.
>>>>>
>>>>>     But we are seeing parquetDecode errors in our applications
>>>>> frequently after this. Looks like these decoding errors were because of
>>>>> changing serde from spark-builtin to hive serde.
>>>>>
>>>>> I feel like,* fixing query plan generation in the spark-sql* is the
>>>>> right approach instead of forcing users to use hive serde.
>>>>>
>>>>> Is there any workaround/way to fix this issue? I would like to hear
>>>>> more thoughts on this :)
>>>>>
>>>>> ------
>>>>> Thanks,
>>>>> Raju Bairishetti,
>>>>> www.lazada.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ------
>>>> Thanks,
>>>> Raju Bairishetti,
>>>> www.lazada.com
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> ------
>>> Thanks,
>>> Raju Bairishetti,
>>> www.lazada.com
>>>
>>
>>
>>
>> --
>>
>> ------
>> Thanks,
>> Raju Bairishetti,
>> www.lazada.com
>>
>
>
>
> --
>
> ------
> Thanks,
> Raju Bairishetti,
> www.lazada.com
>
>
>


-- 

------
Thanks,
Raju Bairishetti,
www.lazada.com

--001a11442fb26488b80546558293
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks Michael for the respopnse.<div><br></div><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Jan 18, 2017 at 2:4=
5 AM, Michael Allman <span dir=3D"ltr">&lt;<a href=3D"mailto:michael@videoa=
mp.com" target=3D"_blank">michael@videoamp.com</a>&gt;</span> wrote:<br><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-lef=
t:1px solid rgb(204,204,204);padding-left:1ex"><div style=3D"word-wrap:brea=
k-word">Hi Raju,<div><br></div><div>I&#39;m sorry this isn&#39;t working fo=
r you. I helped author this functionality and will try my best to help.<br>=
<div><br></div><div>First, I&#39;m curious why you set spark.sql.hive.<wbr>=
convertMetastoreParquet to false? </div></div></div></blockquote><div>I had=
 set as suggested in=C2=A0<span class=3D"gmail-il" style=3D"color:rgb(0,0,0=
);font-family:calibri,arial,helvetica,sans-serif;font-size:16px;background-=
color:rgb(255,255,255)">SPARK</span><span style=3D"color:rgb(0,0,0);font-fa=
mily:calibri,arial,helvetica,sans-serif;font-size:16px">-6910 and correspon=
sing pull reqs.=C2=A0</span>It did not work for me=C2=A0<span style=3D"colo=
r:rgb(0,0,0);font-family:calibri,arial,helvetica,sans-serif;font-size:16px"=
>without =C2=A0setting=C2=A0</span><b><i>spark.sql.hive.<wbr>convertMetasto=
reParquet</i></b> property.=C2=A0</div><div><br></div><blockquote class=3D"=
gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(20=
4,204,204);padding-left:1ex"><div style=3D"word-wrap:break-word"><div><div>=
Can you link specifically to the jira issue or spark pr you referred to? Th=
e first thing I would try is setting=C2=A0spark.sql.hive.<wbr>convertMetast=
oreParquet to true. Setting that to false might also explain why you&#39;re=
 getting parquet decode errors. If you&#39;re writing your table data with =
Spark&#39;s parquet file writer and reading with Hive&#39;s parquet file re=
ader, there may be an incompatibility accounting for the decode errors you&=
#39;re seeing.=C2=A0</div><div><br></div></div></div></blockquote><div>=C2=
=A0<a href=3D"https://issues.apache.org/jira/browse/SPARK-6910">https://iss=
ues.apache.org/jira/browse/SPARK-6910</a> . My main motivation is to avoid =
fetching all the partitions. We reverted spark.sql.hive.<wbr>convertMetasto=
reParquet =C2=A0setting to true to decoding errors. After reverting this it=
 is fetching all partiitons from the table.</div><div><br></div><blockquote=
 class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px so=
lid rgb(204,204,204);padding-left:1ex"><div style=3D"word-wrap:break-word">=
<div><div></div><div>Can you reply with your table&#39;s Hive metastore sch=
ema, including partition schema?</div></div></div></blockquote><div>=C2=A0 =
=C2=A0 =C2=A0col1 string</div><div>=C2=A0 =C2=A0 =C2=A0col2 string</div><di=
v>=C2=A0 =C2=A0 =C2=A0year int</div><div>=C2=A0 =C2=A0 =C2=A0month int</div=
><div>=C2=A0 =C2=A0 =C2=A0day int</div><div>=C2=A0 =C2=A0 =C2=A0hour int =
=C2=A0=C2=A0</div><div>


<p class=3D"gmail-p1"><span class=3D"gmail-s1"># Partition Information<span=
 class=3D"gmail-Apple-tab-span">	</span> <span class=3D"gmail-Apple-tab-spa=
n">	</span>=C2=A0</span></p>
<p class=3D"gmail-p1"><span class=3D"gmail-s1"># col_name=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 <span class=3D"gmail-Apple-tab-span">	</span>data_=
type =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 <span class=3D"gmail-Apple-tab-span=
">	</span>comment =C2=A0 =C2=A0</span></p><p class=3D"gmail-p1"><span class=
=3D"gmail-s1">year =C2=A0int</span></p><p class=3D"gmail-p1"><span class=3D=
"gmail-s1">month int</span></p><p class=3D"gmail-p1"><span class=3D"gmail-s=
1">day int</span></p><p class=3D"gmail-p1"><span class=3D"gmail-s1">hour in=
t</span></p><p class=3D"gmail-p1"><span class=3D"gmail-s1">venture string</=
span></p></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0p=
x 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style=
=3D"word-wrap:break-word"><div><div>=C2=A0</div></div></div></blockquote><b=
lockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-le=
ft:1px solid rgb(204,204,204);padding-left:1ex"><div style=3D"word-wrap:bre=
ak-word"><div><div>Where are the table&#39;s files located?</div></div></di=
v></blockquote><div>In hadoop. Under some user directory.=C2=A0</div><block=
quote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1=
px solid rgb(204,204,204);padding-left:1ex"><div style=3D"word-wrap:break-w=
ord"><div><div> If you do a &quot;show partitions &lt;dbname&gt;.&lt;tablen=
ame&gt;&quot; in the spark-sql shell, does it show the partitions you expec=
t to see? If not, run &quot;msck repair table &lt;dbname&gt;.&lt;tablename&=
gt;&quot;.</div></div></div></blockquote><div>Yes. It is listing the partit=
ions</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8=
ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style=3D"w=
ord-wrap:break-word"><div><div>Cheers,</div><div><br></div><div>Michael</di=
v><div><div class=3D"gmail-h5"><div><br></div><div><br><div><blockquote typ=
e=3D"cite"><div>On Jan 17, 2017, at 12:02 AM, Raju Bairishetti &lt;<a href=
=3D"mailto:raju@apache.org" target=3D"_blank">raju@apache.org</a>&gt; wrote=
:</div><br class=3D"gmail-m_-4309150224552566686Apple-interchange-newline">=
<div><div dir=3D"ltr"><span style=3D"font-size:12.8px">Had a high level loo=
k into the code. Seems=C2=A0</span><span style=3D"background-color:rgb(228,=
228,255);font-family:menlo;font-size:12pt">getHiveQlPartitions</span><span =
style=3D"font-size:12.8px">=C2=A0 method from HiveMetastoreCatalog is getti=
ng called irrespective of=C2=A0</span><span style=3D"color:rgb(0,128,0);fon=
t-family:menlo;font-size:16px;font-weight:bold">metastorePartitionPruning c=
onf value.</span><br style=3D"font-size:12.8px"><div style=3D"font-size:12.=
8px"><br></div><div style=3D"font-size:12.8px">=C2=A0It should not fetch=C2=
=A0<span class=3D"gmail-m_-4309150224552566686gmail-il">all</span>=C2=A0<sp=
an class=3D"gmail-m_-4309150224552566686gmail-il">partitions</span>=C2=A0if=
 we set=C2=A0<span style=3D"color:rgb(0,128,0);font-weight:bold;font-family=
:menlo;font-size:12pt">metastorePartitionPruning</span>=C2=A0<wbr>to true (=
Default value for this is false)=C2=A0</div><div style=3D"font-size:12.8px"=
><pre style=3D"white-space:pre-wrap;font-family:menlo;font-size:12pt"><span=
 style=3D"color:rgb(0,0,128);font-weight:bold">def </span>getHiveQlPartitio=
ns(predicates<wbr>: <span style=3D"color:rgb(32,153,157)">Seq</span>[Expres=
sion] =3D <span style=3D"color:rgb(102,14,122);font-style:italic">Nil</span=
>): <span style=3D"color:rgb(32,153,157)">Seq</span>[Partition] =3D {<br>  =
<span style=3D"color:rgb(0,0,128);font-weight:bold">val </span>rawPartition=
s =3D <span style=3D"color:rgb(0,0,128);font-weight:bold">if </span>(sqlCon=
text.conf.metastorePart<wbr>itionPruning) {<br>    table.getPartitions(pred=
icates<wbr>)<br>  } <span style=3D"color:rgb(0,0,128);font-weight:bold">els=
e </span>{<br>    <span style=3D"color:rgb(102,14,122);font-style:italic">a=
llPartitions<br></span><span style=3D"color:rgb(102,14,122);font-style:ital=
ic">  </span>}</pre><pre style=3D"white-space:pre-wrap;font-family:menlo;fo=
nt-size:12pt">...</pre><pre style=3D"white-space:pre-wrap;font-family:menlo=
;font-size:12pt"><pre style=3D"white-space:pre-wrap;font-family:menlo;font-=
size:12pt"><span style=3D"color:rgb(0,0,128);font-weight:bold">def </span>g=
etPartitions(predicates: <span style=3D"color:rgb(32,153,157)">Seq</span>[E=
xpression]): <span style=3D"color:rgb(32,153,157)">Seq</span>[HivePartition=
] =3D<br>  <span style=3D"color:rgb(102,14,122);font-style:italic">client</=
span>.getPartitionsByFilter(<span style=3D"color:rgb(0,0,128);font-weight:b=
old">t<wbr>his</span>, predicates)</pre><pre style=3D"white-space:pre-wrap;=
font-family:menlo;font-size:12pt"><pre style=3D"white-space:pre-wrap;font-s=
ize:12pt;font-family:menlo"><span style=3D"font-size:12pt;color:rgb(0,0,128=
);font-weight:bold">lazy val </span><span style=3D"font-size:12pt;color:rgb=
(102,14,122);font-style:italic">allPartitions </span><span style=3D"font-si=
ze:12pt">=3D table.getAllPartitions</span><br></pre><pre style=3D"white-spa=
ce:pre-wrap;font-size:12pt;font-family:menlo">But somehow <span style=3D"fo=
nt-size:12pt">getAllPartitions is getting called eventough after setting </=
span><span style=3D"color:rgb(0,128,0);font-weight:bold;white-space:normal;=
font-size:12pt">metastorePartitionPruning</span><span style=3D"font-size:12=
pt"> to true.</span></pre><pre style=3D"white-space:pre-wrap;font-size:12pt=
;font-family:menlo">Am I missing something or looking at wrong place?</pre>=
</pre></pre></div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_=
quote">On Tue, Jan 17, 2017 at 4:01 PM, Raju Bairishetti <span dir=3D"ltr">=
&lt;<a href=3D"mailto:raju@apache.org" target=3D"_blank">raju@apache.org</a=
>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0p=
x 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><d=
iv dir=3D"ltr"><span style=3D"font-size:12.8px">Hello,</span><div style=3D"=
font-size:12.8px">=C2=A0 =C2=A0 =C2=A0=C2=A0</div><div style=3D"font-size:1=
2.8px">=C2=A0 =C2=A0<span class=3D"gmail-m_-4309150224552566686m_4656228884=
07050552gmail-il">Spark</span>=C2=A0<span class=3D"gmail-m_-430915022455256=
6686m_465622888407050552gmail-il">sql</span>=C2=A0is generating=C2=A0<span =
class=3D"gmail-m_-4309150224552566686m_465622888407050552gmail-il">query</s=
pan>=C2=A0<span class=3D"gmail-m_-4309150224552566686m_465622888407050552gm=
ail-il">plan</span>=C2=A0with=C2=A0<span class=3D"gmail-m_-4309150224552566=
686m_465622888407050552gmail-il">all</span><wbr>=C2=A0<span class=3D"gmail-=
m_-4309150224552566686m_465622888407050552gmail-il">partitions</span>=C2=A0=
information even though if we apply filters on=C2=A0<span class=3D"gmail-m_=
-4309150224552566686m_465622888407050552gmail-il">partitions</span>=C2=A0in=
 the=C2=A0<span class=3D"gmail-m_-4309150224552566686m_465622888407050552gm=
ail-il">query</span>.=C2=A0 Due to this,=C2=A0<span class=3D"gmail-m_-43091=
50224552566686m_465622888407050552gmail-il">spark</span>driver/hive=C2=A0me=
tasto<wbr>re=C2=A0is hitting with OOM as each table is with lots of=C2=A0<s=
pan class=3D"gmail-m_-4309150224552566686m_465622888407050552gmail-il">part=
itions</span>.</div><span><div style=3D"font-size:12.8px"></div><div style=
=3D"font-size:12.8px"><br></div><div style=3D"font-size:12.8px">We can conf=
irm from hive audit logs that it tries to=C2=A0<b><i>fetch=C2=A0<span class=
=3D"gmail-m_-4309150224552566686m_465622888407050552gmail-il">all</span>=C2=
=A0<span class=3D"gmail-m_-4309150224552566686m_465622888407050552gmail-il"=
>partitions</span></i></b>=C2=A0from hive metastore.</div><div style=3D"fon=
t-size:12.8px"><br></div><div style=3D"font-size:12.8px">=C2=A02016-12-28 0=
7:18:33,749 INFO =C2=A0[pool-4-thread-184]: HiveMetaStore.audit (HiveMetaSt=
ore.java:logAuditEv<wbr>ent(371)) - ugi=3Drajub =C2=A0 =C2=A0ip=3D/x.x.x.x =
=C2=A0 cmd=3Dget_partitions : db=3Dxxxx tbl=3Dxxxxx</div><div style=3D"font=
-size:12.8px"><br></div><div style=3D"font-size:12.8px"><br></div><div styl=
e=3D"font-size:12.8px">Configured the following parameters in the=C2=A0<spa=
n class=3D"gmail-m_-4309150224552566686m_465622888407050552gmail-il">spark<=
/span>=C2=A0conf to fix the above issue(source: from=C2=A0<span class=3D"gm=
ail-m_-4309150224552566686m_465622888407050552gmail-il">spark</span>-jira &=
amp; github pullreq):</div><div style=3D"font-size:12.8px">=C2=A0 =C2=A0=C2=
=A0<i><span class=3D"gmail-m_-4309150224552566686m_465622888407050552gmail-=
il">spark</span>.<span class=3D"gmail-m_-4309150224552566686m_4656228884070=
50552gmail-il">sql</span>.hive.convertMetast<wbr>oreParquet =C2=A0 false<br=
></i></div><div style=3D"font-size:12.8px"><i>=C2=A0 =C2=A0=C2=A0<span clas=
s=3D"gmail-m_-4309150224552566686m_465622888407050552gmail-il">spark</span>=
.<span class=3D"gmail-m_-4309150224552566686m_465622888407050552gmail-il">s=
ql</span>.hive.metastorePart<wbr>itionPruning =C2=A0 true</i></div><div sty=
le=3D"font-size:12.8px"><br></div><div style=3D"font-size:12.8px"><b>=C2=A0=
 =C2=A0<span class=3D"gmail-m_-4309150224552566686m_465622888407050552gmail=
-il">plan</span>: =C2=A0rdf.explain<br></b></div><div style=3D"font-size:12=
.8px"><b>=C2=A0 =C2=A0=3D=3D Physical=C2=A0<span class=3D"gmail-m_-43091502=
24552566686m_465622888407050552gmail-il">Plan</span>=C2=A0=3D=3D</b></div><=
div style=3D"font-size:12.8px">=C2=A0 =C2=A0 =C2=A0 =C2=A0HiveTableScan [re=
jection_reason#626], MetastoreRelation dbname, tablename, None, =C2=A0 [(ye=
ar#314 =3D 2016),(month#315 =3D 12),(day#316 =3D 28),(hour#317 =3D 2),(vent=
ure#318 =3D DEFAULT)]</div><div style=3D"font-size:12.8px"><i><br></i></div=
><div style=3D"font-size:12.8px"><i>=C2=A0 =C2=A0 get_partitions_by_filter<=
/i>=C2=A0metho<wbr>d is called and fetching only required=C2=A0<span class=
=3D"gmail-m_-4309150224552566686m_465622888407050552gmail-il">partitions</s=
pan>.<br></div><div style=3D"font-size:12.8px"><br></div></span><div style=
=3D"font-size:12.8px">=C2=A0 =C2=A0 But we are seeing parquetDecode errors =
in our applications frequently after this. Looks like these decoding errors=
 were because of changing serde=C2=A0from<span class=3D"gmail-m_-4309150224=
552566686m_465622888407050552gmail-il">spark</span>-builtin to hive serde.<=
/div><span><div style=3D"font-size:12.8px"><br></div><div style=3D"font-siz=
e:12.8px">I feel like,<b>=C2=A0fixing=C2=A0<span class=3D"gmail-m_-43091502=
24552566686m_465622888407050552gmail-il">query</span>=C2=A0<span class=3D"g=
mail-m_-4309150224552566686m_465622888407050552gmail-il">plan</span>=C2=A0g=
enera<wbr>tion in the=C2=A0<span class=3D"gmail-m_-4309150224552566686m_465=
622888407050552gmail-il">spark</span>-<span class=3D"gmail-m_-4309150224552=
566686m_465622888407050552gmail-il">sql</span></b>=C2=A0is the right approa=
ch instead of forcing users to use hive serde.<br></div><div style=3D"font-=
size:12.8px"><br></div><div style=3D"font-size:12.8px">Is there any workaro=
und/way to fix this issue? I would like to hear more thoughts on this :)</d=
iv><div><br></div></span></div><div class=3D"gmail-m_-4309150224552566686HO=
EnZb"><div class=3D"gmail-m_-4309150224552566686h5"><div class=3D"gmail_ext=
ra"><br><div class=3D"gmail_quote">On Tue, Jan 17, 2017 at 4:00 PM, Raju Ba=
irishetti <span dir=3D"ltr">&lt;<a href=3D"mailto:raju@apache.org" target=
=3D"_blank">raju@apache.org</a>&gt;</span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204=
,204,204);padding-left:1ex"><div dir=3D"ltr">Had a high level look into the=
 code. Seems=C2=A0<span style=3D"background-color:rgb(228,228,255);font-fam=
ily:menlo;font-size:12pt">getHiveQlPartitions</span>=C2=A0 method from Hive=
MetastoreCatalog is getting called irrespective of=C2=A0<span style=3D"colo=
r:rgb(0,128,0);font-family:menlo;font-size:16px;font-weight:bold">metastore=
PartitionPruning conf value.</span><br><div><br></div><div>=C2=A0It should =
not fetch all partitions if we set=C2=A0<span style=3D"color:rgb(0,128,0);f=
ont-weight:bold;font-family:menlo;font-size:12pt">metastorePartitionPruning=
</span>=C2=A0<wbr>to true (Default value for this is false)=C2=A0</div><div=
><pre style=3D"font-family:menlo;font-size:12pt"><span style=3D"color:rgb(0=
,0,128);font-weight:bold">def </span>getHiveQlPartitions(predicates<wbr>: <=
span style=3D"color:rgb(32,153,157)">Seq</span>[Expression] =3D <span style=
=3D"color:rgb(102,14,122);font-style:italic">Nil</span>): <span style=3D"co=
lor:rgb(32,153,157)">Seq</span>[Partition] =3D {<br>  <span style=3D"color:=
rgb(0,0,128);font-weight:bold">val </span>rawPartitions =3D <span style=3D"=
color:rgb(0,0,128);font-weight:bold">if </span>(sqlContext.conf.metastorePa=
rt<wbr>itionPruning) {<br>    table.getPartitions(predicates<wbr>)<br>  } <=
span style=3D"color:rgb(0,0,128);font-weight:bold">else </span>{<br>    <sp=
an style=3D"color:rgb(102,14,122);font-style:italic">allPartitions<br></spa=
n><span style=3D"color:rgb(102,14,122);font-style:italic">  </span>}</pre><=
pre style=3D"font-family:menlo;font-size:12pt">...</pre><pre style=3D"font-=
family:menlo;font-size:12pt"><pre style=3D"font-family:menlo;font-size:12pt=
"><span style=3D"color:rgb(0,0,128);font-weight:bold">def </span>getPartiti=
ons(predicates: <span style=3D"color:rgb(32,153,157)">Seq</span>[Expression=
]): <span style=3D"color:rgb(32,153,157)">Seq</span>[HivePartition] =3D<br>=
  <span style=3D"color:rgb(102,14,122);font-style:italic">client</span>.get=
PartitionsByFilter(<span style=3D"color:rgb(0,0,128);font-weight:bold">t<wb=
r>his</span>, predicates)</pre><pre style=3D"font-family:menlo;font-size:12=
pt"><pre style=3D"font-size:12pt;font-family:menlo"><span style=3D"font-siz=
e:12pt;color:rgb(0,0,128);font-weight:bold">lazy val </span><span style=3D"=
font-size:12pt;color:rgb(102,14,122);font-style:italic">allPartitions </spa=
n><span style=3D"font-size:12pt">=3D table.getAllPartitions</span><br></pre=
><pre style=3D"font-size:12pt;font-family:menlo">But somehow <span style=3D=
"font-size:12pt">getAllPartitions is getting called eventough after setting=
 </span><span style=3D"color:rgb(0,128,0);font-weight:bold;white-space:norm=
al;font-size:12pt">metastorePartitionPruning</span><span style=3D"font-size=
:12pt"> to true.</span></pre><pre style=3D"font-size:12pt;font-family:menlo=
">Am I missing something or looking at wrong place?</pre></pre></pre></div>=
</div><div class=3D"gmail-m_-4309150224552566686m_465622888407050552HOEnZb"=
><div class=3D"gmail-m_-4309150224552566686m_465622888407050552h5"><div cla=
ss=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mon, Jan 16, 2017 at 1=
2:53 PM, Raju Bairishetti <span dir=3D"ltr">&lt;<a href=3D"mailto:raju@apac=
he.org" target=3D"_blank">raju@apache.org</a>&gt;</span> wrote:<br><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px=
 solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr">Waiting for sugg=
estions/help on this...=C2=A0</div><div class=3D"gmail_extra"><div><div cla=
ss=3D"gmail-m_-4309150224552566686m_465622888407050552m_-522842599830162751=
0h5"><br><div class=3D"gmail_quote">On Wed, Jan 11, 2017 at 12:14 PM, Raju =
Bairishetti <span dir=3D"ltr">&lt;<a href=3D"mailto:raju@apache.org" target=
=3D"_blank">raju@apache.org</a>&gt;</span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204=
,204,204);padding-left:1ex"><div dir=3D"ltr">Hello,<div>=C2=A0 =C2=A0 =C2=
=A0=C2=A0</div><div>=C2=A0 =C2=A0Spark sql=C2=A0is generating query plan wi=
th all partitions information even though if we apply filters on partitions=
 in the query.=C2=A0 Due to this, spark driver/hive=C2=A0metastore=C2=A0is =
hitting with OOM as each table is with lots of partitions.</div><div></div>=
<div><br></div><div>We can confirm from hive audit logs that it tries to <b=
><i>fetch all partitions</i></b> from hive metastore.</div><div><br></div><=
div>=C2=A02016-12-28 07:18:33,749 INFO =C2=A0[pool-4-thread-184]: HiveMetaS=
tore.audit (HiveMetaStore.java:logAuditEv<wbr>ent(371)) - ugi=3Drajub =C2=
=A0 =C2=A0ip=3D/x.x.x.x =C2=A0 cmd=3Dget_partitions : db=3Dxxxx tbl=3Dxxxxx=
</div><div><br></div><div><br></div><div>Configured the following parameter=
s in the spark conf to fix the above issue(source: from spark-jira &amp; gi=
thub pullreq):</div><div>=C2=A0 =C2=A0 <i>spark.sql.hive.convertMetastor<wb=
r>eParquet =C2=A0 false<br></i></div><div><div><i>=C2=A0 =C2=A0 spark.sql.h=
ive.metastorePartit<wbr>ionPruning =C2=A0 true</i></div></div><div><br></di=
v><div><b>=C2=A0 =C2=A0plan: =C2=A0rdf.explain<br></b></div><div><b>=C2=A0 =
=C2=A0=3D=3D Physical Plan =3D=3D</b></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0=
HiveTableScan [rejection_reason#626], MetastoreRelation dbname, tablename, =
None, =C2=A0 [(year#314 =3D 2016),(month#315 =3D 12),(day#316 =3D 28),(hour=
#317 =3D 2),(venture#318 =3D DEFAULT)]</div><div><i><br></i></div><div><i>=
=C2=A0 =C2=A0 get_partitions_by_filter</i> method is called and fetching on=
ly required partitions.<br></div><div><br></div><div>=C2=A0 =C2=A0 But we a=
re seeing parquetDecode errors in our applications frequently after this. L=
ooks like these decoding errors were because of changing serde=C2=A0from sp=
ark-builtin to hive serde.</div><div><br></div><div>I feel like,<b> fixing =
query plan generation in the spark-sql</b> is the right approach instead of=
 forcing users to use hive serde.<br></div><div><br></div><div>Is there any=
 workaround/way to fix this issue? I would like to hear more thoughts on th=
is :)</div><div><br></div><div>------<br></div><div><div class=3D"gmail-m_-=
4309150224552566686m_465622888407050552m_-5228425998301627510m_824411244237=
8170101m_5016760603387356590gmail_signature"><div dir=3D"ltr"><div dir=3D"l=
tr"><div>Thanks,</div><div>Raju Bairishetti,</div><div><a href=3D"http://ww=
w.lazada.com/" target=3D"_blank">www.lazada.com</a></div></div></div></div>
</div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div>-- <br>=
<span><div class=3D"gmail-m_-4309150224552566686m_465622888407050552m_-5228=
425998301627510m_8244112442378170101gmail_signature"><div dir=3D"ltr"><div>=
<div dir=3D"ltr"><br><div>------</div><div>Thanks,</div><div>Raju Bairishet=
ti,</div><div><a href=3D"http://www.lazada.com/" target=3D"_blank">www.laza=
da.com</a></div></div></div></div></div>
</span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"gmail-m_-4309150224552566686m_465622888407050552m_-5228425998301627510g=
mail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><br><div>------</div=
><div>Thanks,</div><div>Raju Bairishetti,</div><div><a href=3D"http://www.l=
azada.com/" target=3D"_blank">www.lazada.com</a></div></div></div></div></d=
iv>
</div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div class=3D"gmail-m_-4309150224552566686m_465622888407050552gmail_signatu=
re"><div dir=3D"ltr"><div><div dir=3D"ltr"><br><div>------</div><div>Thanks=
,</div><div>Raju Bairishetti,</div><div><a href=3D"http://www.lazada.com/" =
target=3D"_blank">www.lazada.com</a></div></div></div></div></div>
</div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div class=3D"gmail-m_-4309150224552566686gmail_signature"><div dir=3D"ltr"=
><div><div dir=3D"ltr"><br><div>------</div><div>Thanks,</div><div>Raju Bai=
rishetti,</div><div><a href=3D"http://www.lazada.com/" target=3D"_blank">ww=
w.lazada.com</a></div></div></div></div></div>
</div>
</div></blockquote></div><br></div></div></div></div></div></blockquote></d=
iv><br><br clear=3D"all"><div><br></div>-- <br><div class=3D"gmail_signatur=
e"><div dir=3D"ltr"><div><div dir=3D"ltr"><br><div>------</div><div>Thanks,=
</div><div>Raju Bairishetti,</div><div><a href=3D"http://www.lazada.com" ta=
rget=3D"_blank">www.lazada.com</a></div></div></div></div></div>
</div></div>

--001a11442fb26488b80546558293--