Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2B785185C1 for ; Thu, 25 Feb 2016 10:38:21 +0000 (UTC) Received: (qmail 92209 invoked by uid 500); 25 Feb 2016 10:38:19 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 92134 invoked by uid 500); 25 Feb 2016 10:38:19 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 92123 invoked by uid 99); 25 Feb 2016 10:38:19 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Feb 2016 10:38:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 1B7CFC67F7 for ; Thu, 25 Feb 2016 10:38:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.908 X-Spam-Level: **** X-Spam-Status: No, score=4.908 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_COUK=1.1, RCVD_IN_MSPIKE_BL=0.01, RCVD_IN_MSPIKE_L3=1.799, SPF_PASS=-0.001] autolearn=disabled Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id P8rDFPj_Ojq3 for ; Thu, 25 Feb 2016 10:38:16 +0000 (UTC) Received: from p3nlsmtpcp01-04.prod.phx3.secureserver.net (p3nlsmtpcp01-04.prod.phx3.secureserver.net [184.168.200.145]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 827395FB14 for ; Thu, 25 Feb 2016 10:38:15 +0000 (UTC) Received: from p3plcpnl0356.prod.phx3.secureserver.net ([50.62.161.123]) by : HOSTING RELAY : with SMTP id YtJFawsqjOEsbYtJFaweI7; Thu, 25 Feb 2016 03:38:13 -0700 Received: from localhost ([127.0.0.1]:46372 helo=p3plcpnl0356.prod.phx3.secureserver.net) by p3plcpnl0356.prod.phx3.secureserver.net with esmtpa (Exim 4.85) (envelope-from ) id 1aYtJF-0002BM-R3; Thu, 25 Feb 2016 03:38:13 -0700 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_82e6477eb45bfd41934f11698fe8b57f" Date: Thu, 25 Feb 2016 10:38:08 +0000 From: Mich Talebzadeh To: Gopal Vijayaraghavan Cc: user@hive.apache.org, Gopal Vijayaraghavan Subject: Re: Hive 2 performance Organization: Cloud Technology Partners Ltd In-Reply-To: <56b0da5b2e7f17190727b1e079d61058@cloudtechnologypartners.co.uk> References: <0e8cca26d8d8f8b9fe235d0b31ca8d23@cloudtechnologypartners.co.uk> <64cfd4fa3f5584641a83f7dbf4c425be@cloudtechnologypartners.co.uk> <3278F8FC-CECA-4557-8519-9066A8975351@gmail.com> <56b0da5b2e7f17190727b1e079d61058@cloudtechnologypartners.co.uk> Message-ID: <22007ae0aeb5bb3d5c48a7be3dcc7061@cloudtechnologypartners.co.uk> X-Sender: mich.talebzadeh@cloudtechnologypartners.co.uk User-Agent: Roundcube Webmail/1.0.6 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - p3plcpnl0356.prod.phx3.secureserver.net X-AntiAbuse: Original Domain - hive.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - cloudtechnologypartners.co.uk X-Get-Message-Sender-Via: p3plcpnl0356.prod.phx3.secureserver.net: authenticated_id: mich.talebzadeh@cloudtechnologypartners.co.uk X-Source: X-Source-Args: X-Source-Dir: X-CMAE-Envelope: MS4wfExMg3ZQCXcRurP3t3BdWF9XV0jRTEy51pbgUmFBkwNR15N4USP7OZtLJFwe1l7l0z33cdv9l0y5A6ccU1hsOhLoqPwZC0rDdFcTRyMSJDZ9XzgKh4i6 8vO37MgPqSCjfh3gowToEiTKvoNGDMsRmP55Vy1kzEdtOWpWRMth3+G66jnChMSKEONiZaLVL4VZaWPNfgTS3vz6iXYqkgukMyRkNKbHkmqVeSow2x5BgLAV 7cy0r/9OQdKiKxEPxyyj9A== --=_82e6477eb45bfd41934f11698fe8b57f Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII Apologies the job on Spark using Functional programming was run on a bigger table. The correct timing is 42 seconds for Spark On 25/02/2016 10:15, Mich Talebzadeh wrote: > hanks Gopal I made the following observation so far: > > Using the old MR you get this message now which is fine > > Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases. > > use oraclehadoop; > --set hive.execution.engine=spark; > set hive.execution.engine=mr; > -- > -- Get the total amount sold for each calendar month > -- > > select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS StartTime; > > CREATE TEMPORARY TABLE tmp AS > SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales > --FROM smallsales s, times t, channels c > FROM smallsales s, times t, channels c > WHERE s.time_id = t.time_id > AND s.channel_id = c.channel_id > GROUP BY t.calendar_month_desc, c.channel_desc > ; > > select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS FirstQuery; > SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales > from tmp > ORDER BY MONTH, CHANNEL LIMIT 5 > ; > select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS SecondQuery; > SELECT channel_desc AS CHANNEL, MAX(TotalSales) AS SALES > FROM tmp > GROUP BY channel_desc > order by SALES DESC LIMIT 5 > ; > select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS EndTime; > > This batch returns results on MR in 2 min, 3 seconds > > If I change my engine to Hive 2 on Spark 1.3.1. I get it back in 1 min, 9 sec > > If I run that job on Spark 1.5.2 shell against the same tables using Functional programming and Hive Context for tables > > val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) > println ("nStarted at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) > HiveContext.sql("use oraclehadoop") > var s = HiveContext.table("sales").select("AMOUNT_SOLD","TIME_ID","CHANNEL_ID") > val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC") > val t = HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC") > println ("ncreating data set at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) > val rs = s.join(t,"time_id").join(c,"channel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_sold").as("TotalSales")) > println ("nfirst query at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) > val rs1 = rs.orderBy("calendar_month_desc","channel_desc").take(5).foreach(println) > println ("nsecond query at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) > val rs2 =rs.groupBy("channel_desc").agg(max("TotalSales").as("SALES")).orderBy("SALES").sort(desc("SALES")).take(5).foreach(println) > println ("nFinished at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) > > I get the job done in under 8 min. Ok this is not a benchmark for Spark but shows that Hive 2 has improved significantly IMO. I also had Hive on Spark 1.3.1 crashing on certain large tables(had to revert to MR) but no issues now. > > HTH > > On 25/02/2016 09:13, Gopal Vijayaraghavan wrote: Correct hence the question as I have done some preliminary tests on Hive 2. I want to share insights with other people who have performed the same > > If you have feedback on Hive-2.0, I'm all ears. > > I'm building up 2.1 features & fixes, so now would be a good time to bring > stuff up. > > Speed mostly depends on whether you're using Hive-2.0 with LLAP or not - > if you're using the old engines, the plans still get much better (even for > MR). > > Tez does get some stuff out of it, like the new shuffle join vertex > manager (hive.optimize.dynamic.partition.hashjoin). > > LLAP will still win that out for <10s queries, because it takes approx ~10 > mins for all the auto-generated vectorized classes to get JIT'd into tight > SIMD loops. > > For something like TPC-H Q1, you can slowly see it turning all the null > checks into UncommonTrapBlob as the JIT slowly learns about the data & > finds .noNulls is always true. > > Cheers, > Gopal -- Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Cloud Technology Partners Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Cloud Technology partners Ltd, its subsidiaries nor their employees accept any responsibility. -- Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Cloud Technology Partners Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Cloud Technology partners Ltd, its subsidiaries nor their employees accept any responsibility. --=_82e6477eb45bfd41934f11698fe8b57f Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Apologies the job on Spark using  Functional programming was r= un on a bigger table.

The correct timing is 42 seconds for Spark

 

On 25/02/2016 10:15, Mich Talebzadeh wrote:

hanks Gopal I made the following observation so far:

Using the old MR you get this message now which is fine

Hive-on-MR is deprecated in Hive 2 and may not be available in the futur= e versions. Consider using a different execution engine (i.e. tez, spark) o= r using Hive 1.X releases.

use oraclehadoop;
--set hive.execution.eng= ine=3Dspark;
set hive.execution.engine=3Dmr;--
-- Get the total amount sold for each c= alendar month
--

select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss'= ) AS StartTime;

CREATE TEMPORARY TABLE tmp AS
SELECT t.cal= endar_month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales--FROM smallsales s, times t, channels c
= FROM smallsales s, times t, channels c
WHERE s.time_id= =3D t.time_id
AND   s.channel_id =3D c.chan= nel_id
GROUP BY t.calendar_month_desc, c.channel_desc<= /span>
;

select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss'= ) AS FirstQuery;
SELECT calendar_month_desc AS MONTH, = channel_desc AS CHANNEL, TotalSales
from tmpORDER BY MONTH, CHANNEL LIMIT 5
;select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss= ') AS SecondQuery;
SELECT channel_desc AS CHANNEL, MAX= (TotalSales)  AS SALES
FROM tmp
GROUP BY channel_desc
order by SALES DESC LIMIT 5<= /span>
;
select from_unixtime(unix_times= tamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS EndTime;

This batch returns results on MR in 2 min, 3 seconds

If I change my engine to Hive 2 on Spark 1.3.1. I get it back in 1 min, = 9 sec

 

If I run that job on Spark 1.5.2 shell  against the same table= s using Functional programming and Hive Context for tables

val Hi= veContext =3D new org.apache.spark.sql.hive.HiveContext(sc)
println ("\n= Started at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/M= M/yyyy HH:mm:ss.ss') ").collect.foreach(println)
HiveContext.sql("use or= aclehadoop")
var s =3D HiveContext.table("sales").select("AMOUNT_SOLD","= TIME_ID","CHANNEL_ID")
val c =3D HiveContext.table("channels").select("C= HANNEL_ID","CHANNEL_DESC")
val t =3D HiveContext.table("times").select("= TIME_ID","CALENDAR_MONTH_DESC")
println ("\ncreating data set at"); Hive= Context.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss= ') ").collect.foreach(println)
val rs =3D s.join(t,"time_id").join(c,"ch= annel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_so= ld").as("TotalSales"))
println ("\nfirst query at"); HiveContext.sql("SE= LECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect= =2Eforeach(println)
val rs1 =3D rs.orderBy("calendar_month_desc","channe= l_desc").take(5).foreach(println)
println ("\nsecond query at"); HiveCon= text.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') = ").collect.foreach(println)
val rs2 =3Drs.groupBy("channel_desc").agg(ma= x("TotalSales").as("SALES")).orderBy("SALES").sort(desc("SALES")).take(5)= =2Eforeach(println)
println ("\nFinished at"); HiveContext.sql("SELECT F= ROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach= (println)

I get the job done in under 8  min. Ok this is not a benchmark for = Spark but shows that Hive 2 has improved significantly IMO. I also had Hive= on Spark 1.3.1 crashing on certain large tables(had to revert to MR) but n= o issues now.

HTH

On 25/02/2016 09:13, Gopal Vijayaraghavan wrote: