Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
MIME-Version: 1.0
Content-Type: multipart/alternative;
 boundary="=_82e6477eb45bfd41934f11698fe8b57f"
Date: Thu, 25 Feb 2016 10:38:08 +0000
From: Mich Talebzadeh <mich.talebzadeh@cloudtechnologypartners.co.uk>
To: Gopal Vijayaraghavan <gopalv@apache.org>
Cc: user@hive.apache.org, Gopal Vijayaraghavan <gopal@hortonworks.com>
Subject: Re: Hive 2 performance
Organization: Cloud Technology Partners Ltd
In-Reply-To: <56b0da5b2e7f17190727b1e079d61058@cloudtechnologypartners.co.uk>
References: <0e8cca26d8d8f8b9fe235d0b31ca8d23@cloudtechnologypartners.co.uk>
 <AF9D4EB7-84F4-49E5-8693-92B22B7A9A78@gmail.com>
 <64cfd4fa3f5584641a83f7dbf4c425be@cloudtechnologypartners.co.uk>
 <3278F8FC-CECA-4557-8519-9066A8975351@gmail.com>
 <d0ab82f178dd33ad7a8ce7cb9f72fe89@cloudtechnologypartners.co.uk>
 <D2F3FB03.41A6C%gopal@hortonworks.com>
 <56b0da5b2e7f17190727b1e079d61058@cloudtechnologypartners.co.uk>
Message-ID: <22007ae0aeb5bb3d5c48a7be3dcc7061@cloudtechnologypartners.co.uk>
User-Agent: Roundcube Webmail/1.0.6

--=_82e6477eb45bfd41934f11698fe8b57f
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII

 
Apologies the job on Spark using Functional programming was run on a
bigger table. 

The correct timing is 42 seconds for Spark 

On 25/02/2016 10:15, Mich Talebzadeh wrote: 

> hanks Gopal I made the following observation so far: 
> 
> Using the old MR you get this message now which is fine 
> 
> Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases. 
> 
> use oraclehadoop;
> --set hive.execution.engine=spark;
> set hive.execution.engine=mr;
> --
> -- Get the total amount sold for each calendar month
> -- 
> 
> select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS StartTime; 
> 
> CREATE TEMPORARY TABLE tmp AS
> SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales
> --FROM smallsales s, times t, channels c
> FROM smallsales s, times t, channels c
> WHERE s.time_id = t.time_id
> AND s.channel_id = c.channel_id
> GROUP BY t.calendar_month_desc, c.channel_desc
> ; 
> 
> select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS FirstQuery;
> SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales
> from tmp
> ORDER BY MONTH, CHANNEL LIMIT 5
> ;
> select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS SecondQuery;
> SELECT channel_desc AS CHANNEL, MAX(TotalSales) AS SALES
> FROM tmp
> GROUP BY channel_desc
> order by SALES DESC LIMIT 5
> ;
> select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS EndTime; 
> 
> This batch returns results on MR in 2 min, 3 seconds 
> 
> If I change my engine to Hive 2 on Spark 1.3.1. I get it back in 1 min, 9 sec 
> 
> If I run that job on Spark 1.5.2 shell against the same tables using Functional programming and Hive Context for tables 
> 
> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> println ("nStarted at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println)
> HiveContext.sql("use oraclehadoop")
> var s = HiveContext.table("sales").select("AMOUNT_SOLD","TIME_ID","CHANNEL_ID")
> val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC")
> val t = HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC")
> println ("ncreating data set at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println)
> val rs = s.join(t,"time_id").join(c,"channel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_sold").as("TotalSales"))
> println ("nfirst query at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println)
> val rs1 = rs.orderBy("calendar_month_desc","channel_desc").take(5).foreach(println)
> println ("nsecond query at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println)
> val rs2 =rs.groupBy("channel_desc").agg(max("TotalSales").as("SALES")).orderBy("SALES").sort(desc("SALES")).take(5).foreach(println)
> println ("nFinished at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) 
> 
> I get the job done in under 8 min. Ok this is not a benchmark for Spark but shows that Hive 2 has improved significantly IMO. I also had Hive on Spark 1.3.1 crashing on certain large tables(had to revert to MR) but no issues now. 
> 
> HTH 
> 
> On 25/02/2016 09:13, Gopal Vijayaraghavan wrote: Correct hence the question as I have done some preliminary tests on Hive 2. I want to share insights with other people who have performed the same 
> 
> If you have feedback on Hive-2.0, I'm all ears.
> 
> I'm building up 2.1 features & fixes, so now would be a good time to bring
> stuff up.
> 
> Speed mostly depends on whether you're using Hive-2.0 with LLAP or not -
> if you're using the old engines, the plans still get much better (even for
> MR).
> 
> Tez does get some stuff out of it, like the new shuffle join vertex
> manager (hive.optimize.dynamic.partition.hashjoin).
> 
> LLAP will still win that out for <10s queries, because it takes approx ~10
> mins for all the auto-generated vectorized classes to get JIT'd into tight
> SIMD loops.
> 
> For something like TPC-H Q1, you can slowly see it turning all the null
> checks into UncommonTrapBlob as the JIT slowly learns about the data &
> finds .noNulls is always true.
> 
> Cheers,
> Gopal

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 
--=_82e6477eb45bfd41934f11698fe8b57f
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
<html><body style=3D'font-size: 10pt; font-family: Verdana,Geneva,sans-seri=
f'>
<p>Apologies the job on Spark&nbsp;using &nbsp;Functional programming was r=
un on a bigger table.</p>
<p>The correct timing is 42 seconds for Spark</p>
<p>&nbsp;</p>
<p>On 25/02/2016 10:15, Mich Talebzadeh wrote:</p>
<blockquote type=3D"cite" style=3D"padding-left:5px; border-left:#1010ff 2p=
x solid; margin-left:5px"><!-- html ignored --><!-- head ignored --><!-- me=
ta ignored -->
<p>hanks Gopal I made the following observation so far:</p>
<p>Using the old MR you get this message now which is fine</p>
<p>Hive-on-MR is deprecated in Hive 2 and may not be available in the futur=
e versions. Consider using a different execution engine (i.e. tez, spark) o=
r using Hive 1.X releases.</p>
<p><span style=3D"color: #0000ff; font-family: courier new,courier; font-si=
ze: small;">use oraclehadoop;</span><br /><span style=3D"color: #0000ff; fo=
nt-family: courier new,courier; font-size: small;">--set hive.execution.eng=
ine=3Dspark;</span><br /><span style=3D"color: #0000ff; font-family: courie=
r new,courier; font-size: small;">set hive.execution.engine=3Dmr;</span><br=
 /><span style=3D"color: #0000ff; font-family: courier new,courier; font-si=
ze: small;">--</span><br /><span style=3D"color: #0000ff; font-family: cour=
ier new,courier; font-size: small;">-- Get the total amount sold for each c=
alendar month</span><br /><span style=3D"color: #0000ff; font-family: couri=
er new,courier; font-size: small;">--</span></p>
<p><span style=3D"color: #0000ff; font-family: courier new,courier; font-si=
ze: small;">select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss'=
) AS StartTime;</span></p>
<p><span style=3D"color: #0000ff; font-family: courier new,courier; font-si=
ze: small;">CREATE TEMPORARY TABLE tmp AS</span><br /><span style=3D"color:=
 #0000ff; font-family: courier new,courier; font-size: small;">SELECT t.cal=
endar_month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales</span><b=
r /><span style=3D"color: #0000ff; font-family: courier new,courier; font-s=
ize: small;">--FROM smallsales s, times t, channels c</span><br /><span sty=
le=3D"color: #0000ff; font-family: courier new,courier; font-size: small;">=
FROM smallsales s, times t, channels c</span><br /><span style=3D"color: #0=
000ff; font-family: courier new,courier; font-size: small;">WHERE s.time_id=
 =3D t.time_id</span><br /><span style=3D"color: #0000ff; font-family: cour=
ier new,courier; font-size: small;">AND&nbsp;&nbsp; s.channel_id =3D c.chan=
nel_id</span><br /><span style=3D"color: #0000ff; font-family: courier new,=
courier; font-size: small;">GROUP BY t.calendar_month_desc, c.channel_desc<=
/span><br /><span style=3D"color: #0000ff; font-family: courier new,courier=
; font-size: small;">;</span></p>
<p><span style=3D"color: #0000ff; font-family: courier new,courier; font-si=
ze: small;">select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss'=
) AS FirstQuery;</span><br /><span style=3D"color: #0000ff; font-family: co=
urier new,courier; font-size: small;">SELECT calendar_month_desc AS MONTH, =
channel_desc AS CHANNEL, TotalSales</span><br /><span style=3D"color: #0000=
ff; font-family: courier new,courier; font-size: small;">from tmp</span><br=
 /><span style=3D"color: #0000ff; font-family: courier new,courier; font-si=
ze: small;">ORDER BY MONTH, CHANNEL LIMIT 5</span><br /><span style=3D"colo=
r: #0000ff; font-family: courier new,courier; font-size: small;">;</span><b=
r /><span style=3D"color: #0000ff; font-family: courier new,courier; font-s=
ize: small;">select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss=
') AS SecondQuery;</span><br /><span style=3D"color: #0000ff; font-family: =
courier new,courier; font-size: small;">SELECT channel_desc AS CHANNEL, MAX=
(TotalSales)&nbsp; AS SALES</span><br /><span style=3D"color: #0000ff; font=
-family: courier new,courier; font-size: small;">FROM tmp</span><br /><span=
 style=3D"color: #0000ff; font-family: courier new,courier; font-size: smal=
l;">GROUP BY channel_desc</span><br /><span style=3D"color: #0000ff; font-f=
amily: courier new,courier; font-size: small;">order by SALES DESC LIMIT 5<=
/span><br /><span style=3D"color: #0000ff; font-family: courier new,courier=
; font-size: small;">;</span><br /><span style=3D"color: #0000ff; font-fami=
ly: courier new,courier; font-size: small;">select from_unixtime(unix_times=
tamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS EndTime;</span></p>
<p>This batch&nbsp;returns results on MR in 2 min, 3 seconds</p>
<p>If I change my engine to Hive 2 on Spark 1.3.1. I get it back in 1 min, =
9 sec</p>
<p>&nbsp;</p>
<p>If I run that job&nbsp;on Spark 1.5.2 shell &nbsp;against the same table=
s using Functional programming and&nbsp;Hive Context for tables</p>
<p><span style=3D"color: #0000ff; font-family: courier new,courier;">val Hi=
veContext =3D new org.apache.spark.sql.hive.HiveContext(sc)</span><br /><sp=
an style=3D"color: #0000ff; font-family: courier new,courier;">println ("\n=
Started at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/M=
M/yyyy HH:mm:ss.ss') ").collect.foreach(println)</span><br /><span style=3D=
"color: #0000ff; font-family: courier new,courier;">HiveContext.sql("use or=
aclehadoop")</span><br /><span style=3D"color: #0000ff; font-family: courie=
r new,courier;">var s =3D HiveContext.table("sales").select("AMOUNT_SOLD","=
TIME_ID","CHANNEL_ID")</span><br /><span style=3D"color: #0000ff; font-fami=
ly: courier new,courier;">val c =3D HiveContext.table("channels").select("C=
HANNEL_ID","CHANNEL_DESC")</span><br /><span style=3D"color: #0000ff; font-=
family: courier new,courier;">val t =3D HiveContext.table("times").select("=
TIME_ID","CALENDAR_MONTH_DESC")</span><br /><span style=3D"color: #0000ff; =
font-family: courier new,courier;">println ("\ncreating data set at"); Hive=
Context.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss=
') ").collect.foreach(println)</span><br /><span style=3D"color: #0000ff; f=
ont-family: courier new,courier;">val rs =3D s.join(t,"time_id").join(c,"ch=
annel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_so=
ld").as("TotalSales"))</span><br /><span style=3D"color: #0000ff; font-fami=
ly: courier new,courier;">println ("\nfirst query at"); HiveContext.sql("SE=
LECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect=
=2Eforeach(println)</span><br /><span style=3D"color: #0000ff; font-family:=
 courier new,courier;">val rs1 =3D rs.orderBy("calendar_month_desc","channe=
l_desc").take(5).foreach(println)</span><br /><span style=3D"color: #0000ff=
; font-family: courier new,courier;">println ("\nsecond query at"); HiveCon=
text.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') =
").collect.foreach(println)</span><br /><span style=3D"color: #0000ff; font=
-family: courier new,courier;">val rs2 =3Drs.groupBy("channel_desc").agg(ma=
x("TotalSales").as("SALES")).orderBy("SALES").sort(desc("SALES")).take(5)=
=2Eforeach(println)</span><br /><span style=3D"color: #0000ff; font-family:=
 courier new,courier;">println ("\nFinished at"); HiveContext.sql("SELECT F=
ROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach=
(println)</span></p>
<p>I get the job done in under 8&nbsp; min. Ok this is not a benchmark for =
Spark but shows that Hive 2 has improved significantly IMO. I also had Hive=
 on Spark 1.3.1 crashing on certain large tables(had to revert to MR) but n=
o issues now.</p>
<p>HTH</p>
<p>On 25/02/2016 09:13, Gopal Vijayaraghavan wrote:</p>
<blockquote style=3D"padding-left: 5px; border-left: #1010ff 2px solid; mar=
gin-left: 5px;"><!-- node type 8 --><!-- node type 8 --><!-- node type 8 --=
>
<blockquote style=3D"padding-left: 5px; border-left: #1010ff 2px solid; mar=
gin-left: 5px;">Correct hence the question as I have done some preliminary =
tests on Hive 2. I want to share insights with other people who have perfor=
med the same</blockquote>
<pre>If you have feedback on Hive-2.0, I'm all ears.

I'm building up 2.1 features &amp; fixes, so now would be a good time to br=
ing
stuff up.

Speed mostly depends on whether you're using Hive-2.0 with LLAP or not -
if you're using the old engines, the plans still get much better (even for
MR).

Tez does get some stuff out of it, like the new shuffle join vertex
manager (hive.optimize.dynamic.partition.hashjoin).

LLAP will still win that out for &lt;10s queries, because it takes approx ~=
10
mins for all the auto-generated vectorized classes to get JIT'd into tight
SIMD loops.

For something like TPC-H Q1, you can slowly see it turning all the null
checks into UncommonTrapBlob as the JIT slowly learns about the data &amp;
finds .noNulls is always true.

Cheers,
Gopal


</pre>
</blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<div>-- <br />
<pre>Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6z=
P6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential. This m=
essage is for the designated recipient only, if you are not the intended re=
cipient, you should destroy it immediately. Any information in this message=
 shall not be understood as given or endorsed by Cloud Technology Partners =
Ltd, its subsidiaries or their employees, unless expressly so stated. It is=
 the responsibility of the recipient to ensure that this email is virus fre=
e, therefore neither Cloud Technology partners Ltd, its subsidiaries nor th=
eir employees accept any responsibility.

</pre>
</div>
</blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<div>-- <br />
<pre>Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6z=
P6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential. This m=
essage is for the designated recipient only, if you are not the intended re=
cipient, you should destroy it immediately. Any information in this message=
 shall not be understood as given or endorsed by Cloud Technology Partners =
Ltd, its subsidiaries or their employees, unless expressly so stated. It is=
 the responsibility of the recipient to ensure that this email is virus fre=
e, therefore neither Cloud Technology partners Ltd, its subsidiaries nor th=
eir employees accept any responsibility.

</pre>
</div>
</body></html>

--=_82e6477eb45bfd41934f11698fe8b57f--