Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 2A5842009E8 for ; Mon, 30 May 2016 23:00:15 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 292E6160A19; Mon, 30 May 2016 21:00:15 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7E789160969 for ; Mon, 30 May 2016 23:00:11 +0200 (CEST) Received: (qmail 43704 invoked by uid 500); 30 May 2016 21:00:04 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 43509 invoked by uid 99); 30 May 2016 21:00:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 May 2016 21:00:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 59F7D180592 for ; Mon, 30 May 2016 21:00:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.146 X-Spam-Level: X-Spam-Status: No, score=-0.146 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.426, SPF_PASS=-0.001, WEIRD_PORT=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id k3bb7mYJXXde for ; Mon, 30 May 2016 20:59:57 +0000 (UTC) Received: from BLU004-OMC2S34.hotmail.com (blu004-omc2s34.hotmail.com [65.55.111.109]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 0B2495F24F for ; Mon, 30 May 2016 20:59:56 +0000 (UTC) Received: from BLU436-SMTP93 ([65.55.111.72]) by BLU004-OMC2S34.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.23008); Mon, 30 May 2016 13:59:50 -0700 X-TMN: [rqhex1+faXdGXWkemfI9VQ/4/WmD9mch] X-Originating-Email: [msegel_hadoop@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="Apple-Mail=_2AC399B7-81D0-433B-BEA1-BAAC1492A6EC" MIME-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Using Spark on Hive with Hive also using Spark as its execution engine From: Michael Segel In-Reply-To: Date: Mon, 30 May 2016 13:59:48 -0700 CC: user , =?utf-8?Q?J=C3=B6rn_Franke?= , ayan guha , "user @spark" References: <1A0C519B-0BCA-4E9C-89B0-0546F6BFA346@gmail.com> <6402781A-E1F7-47F4-904F-7F6DA3A67CC5@gmail.com> <341CD464-B765-46B3-91BB-CCCD48B64769@gmail.com> To: Mich Talebzadeh X-Mailer: Apple Mail (2.3124) X-OriginalArrivalTime: 30 May 2016 20:59:49.0623 (UTC) FILETIME=[3463E070:01D1BAB6] archived-at: Mon, 30 May 2016 21:00:15 -0000 --Apple-Mail=_2AC399B7-81D0-433B-BEA1-BAAC1492A6EC Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" And you have MapR supporting Apache Drill.=20 So these are all alternatives to Spark, and its not necessarily an = either or scenario. You can have both.=20 > On May 30, 2016, at 12:49 PM, Mich Talebzadeh = wrote: >=20 > yep Hortonworks supports Tez for one reason or other which I am going = hopefully to test it as the query engine for hive. Tthough I think Spark = will be faster because of its in-memory support. >=20 > Also if you are independent then you better off dealing with Spark and = Hive without the need to support another stack like Tez. >=20 > Cloudera support Impala instead of Hive but it is not something I have = used. . >=20 > HTH >=20 > Dr Mich Talebzadeh > =20 > LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdO= ABUrV8Pw = > =20 > http://talebzadehmich.wordpress.com = > =20 >=20 > On 30 May 2016 at 20:19, Michael Segel > wrote: > Mich,=20 >=20 > Most people use vendor releases because they need to have the support.=20= > Hortonworks is the vendor who has the most skin in the game when it = comes to Tez.=20 >=20 > If memory serves, Tez isn=E2=80=99t going to be M/R but a local = execution engine? Then LLAP is the in-memory piece to speed up Tez?=20 >=20 > HTH >=20 > -Mike >=20 >> On May 29, 2016, at 1:35 PM, Mich Talebzadeh = > wrote: >>=20 >> thanks I think the problem is that the TEZ user group is = exceptionally quiet. Just sent an email to Hive user group to see anyone = has managed to built a vendor independent version. >>=20 >>=20 >> Dr Mich Talebzadeh >> =20 >> LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdO= ABUrV8Pw = >> =20 >> http://talebzadehmich.wordpress.com = >> =20 >>=20 >> On 29 May 2016 at 21:23, J=C3=B6rn Franke > wrote: >> Well I think it is different from MR. It has some optimizations which = you do not find in MR. Especially the LLAP option in Hive2 makes it = interesting.=20 >>=20 >> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 = it is integrated in the Hortonworks distribution.=20 >>=20 >>=20 >> On 29 May 2016, at 21:43, Mich Talebzadeh > wrote: >>=20 >>> Hi Jorn, >>>=20 >>> I started building apache-tez-0.8.2 but got few errors. Couple of = guys from TEZ user group kindly gave a hand but I could not go very far = (or may be I did not make enough efforts) making it work. >>>=20 >>> That TEZ user group is very quiet as well. >>>=20 >>> My understanding is TEZ is MR with DAG but of course Spark has both = plus in-memory capability. >>>=20 >>> It would be interesting to see what version of TEZ works as = execution engine with Hive. >>>=20 >>> Vendors are divided on this (use Hive with TEZ) or use Impala = instead of Hive etc as I am sure you already know. >>>=20 >>> Cheers, >>>=20 >>>=20 >>>=20 >>>=20 >>> Dr Mich Talebzadeh >>> =20 >>> LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdO= ABUrV8Pw = >>> =20 >>> http://talebzadehmich.wordpress.com = >>> =20 >>>=20 >>> On 29 May 2016 at 20:19, J=C3=B6rn Franke > wrote: >>> Very interesting do you plan also a test with TEZ? >>>=20 >>> On 29 May 2016, at 13:40, Mich Talebzadeh > wrote: >>>=20 >>>> Hi, >>>>=20 >>>> I did another study of Hive using Spark engine compared to Hive = with MR. >>>>=20 >>>> Basically took the original table imported using Sqoop and created = and populated a new ORC table partitioned by year and month into 48 = partitions as follows: >>>>=20 >>>> >>>> =E2=80=8B=20 >>>> Connections use JDBC via beeline. Now for each partition using MR = it takes an average of 17 minutes as seen below for each PARTITION.. = Now that is just an individual partition and there are 48 partitions. >>>>=20 >>>> In contrast doing the same operation with Spark engine took 10 = minutes all inclusive. I just gave up on MR. You can see the StartTime = and FinishTime from below >>>>=20 >>>> >>>>=20 >>>> This is by no means indicate that Spark is much better than MR but = shows that some very good results can ve achieved using Spark engine. >>>>=20 >>>>=20 >>>> Dr Mich Talebzadeh >>>> =20 >>>> LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdO= ABUrV8Pw = >>>> =20 >>>> http://talebzadehmich.wordpress.com = >>>> =20 >>>>=20 >>>> On 24 May 2016 at 08:03, Mich Talebzadeh > wrote: >>>> Hi, >>>>=20 >>>> We use Hive as the database and use Spark as an all purpose query = tool. >>>>=20 >>>> Whether Hive is the write database for purpose or one is better off = with something like Phoenix on Hbase, well the answer is it depends and = your mileage varies.=20 >>>>=20 >>>> So fit for purpose. >>>>=20 >>>> Ideally what wants is to use the fastest method to get the = results. How fast we confine it to our SLA agreements in production and = that helps us from unnecessary further work as we technologists like to = play around. >>>>=20 >>>> So in short, we use Spark most of the time and use Hive as the = backend engine for data storage, mainly ORC tables. >>>>=20 >>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have = a combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 = but at the moment it is one of my projects. >>>>=20 >>>> We do not use any vendor's products as it enables us to move away = from being tied down after years of SAP, Oracle and MS dependency to yet = another vendor. Besides there is some politics going on with one = promoting Tez and another Spark as a backend. That is fine but obviously = we prefer an independent assessment ourselves. >>>>=20 >>>> My gut feeling is that one needs to look at the use case. Recently = we had to import a very large table from Oracle to Hive and decided to = use Spark 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just = used JDBC connection with temp table and it was good. We could have used = sqoop but decided to settle for Spark so it all depends on use case. >>>>=20 >>>> HTH >>>>=20 >>>>=20 >>>>=20 >>>> Dr Mich Talebzadeh >>>> =20 >>>> LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdO= ABUrV8Pw = >>>> =20 >>>> http://talebzadehmich.wordpress.com = >>>> =20 >>>>=20 >>>> On 24 May 2016 at 03:11, ayan guha > wrote: >>>> Hi >>>>=20 >>>> Thanks for very useful stats.=20 >>>>=20 >>>> Did you have any benchmark for using Spark as backend engine for = Hive vs using Spark thrift server (and run spark code for hive queries)? = We are using later but it will be very useful to remove thriftserver, if = we can.=20 >>>>=20 >>>> On Tue, May 24, 2016 at 9:51 AM, J=C3=B6rn Franke = > wrote: >>>>=20 >>>> Hi Mich, >>>>=20 >>>> I think these comparisons are useful. One interesting aspect could = be hardware scalability in this context. Additionally different type of = computations. Furthermore, one could compare Spark and Tez+llap as = execution engines. I have the gut feeling that each one can be = justified by different use cases. >>>> Nevertheless, there should be always a disclaimer for such = comparisons, because Spark and Hive are not good for a lot of concurrent = lookups of single rows. They are not good for frequently write small = amounts of data (eg sensor data). Here hbase could be more interesting. = Other use cases can justify graph databases, such as Titan, or text = analytics/ data matching using Solr on Hadoop. >>>> Finally, even if you have a lot of data you need to think if you = always have to process everything. For instance, I have found valid use = cases in practice where we decided to evaluate 10 machine learning = models in parallel on only a sample of data and only evaluate the = "winning" model of the total of data. >>>>=20 >>>> As always it depends :)=20 >>>>=20 >>>> Best regards >>>>=20 >>>> P.s.: at least Hortonworks has in their distribution spark 1.5 with = hive 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere = described how to manage bringing both together. You may check also = Apache Bigtop (vendor neutral distribution) on how they managed to bring = both together. >>>>=20 >>>> On 23 May 2016, at 01:42, Mich Talebzadeh = > wrote: >>>>=20 >>>>> Hi, >>>>> =20 >>>>> I have done a number of extensive tests using Spark-shell with = Hive DB and ORC tables. >>>>> =20 >>>>> Now one issue that we typically face is and I quote: >>>>> =20 >>>>> Spark is fast as it uses Memory and DAG. Great but when we save = data it is not fast enough >>>>>=20 >>>>> OK but there is a solution now. If you use Spark with Hive and you = are on a descent version of Hive >=3D 0.14, then you can also deploy = Spark as execution engine for Hive. That will make your application run = pretty fast as you no longer rely on the old Map-Reduce for Hive engine. = In a nutshell what you are gaining speed in both querying and storage. >>>>> =20 >>>>> I have made some comparisons on this set-up and I am sure some of = you will find it useful. >>>>> =20 >>>>> The version of Spark I use for Spark queries (Spark as query tool) = is 1.6. >>>>> The version of Hive I use in Hive 2 >>>>> The version of Spark I use as Hive execution engine is 1.3.1 It = works and frankly Spark 1.3.1 as an execution engine is adequate (until = we sort out the Hadoop libraries mismatch). >>>>> =20 >>>>> An example I am using Hive on Spark engine to find the min and max = of IDs for a table with 1 billion rows: >>>>> =20 >>>>> 0: jdbc:hive2://rhes564:10010/default> select min(id), = max(id),avg(id), stddev(id) from oraclehadoop.dummy; >>>>> Query ID =3D = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >>>>> =20 >>>>> =20 >>>>> Starting Spark Job =3D 5e092ef9-d798-4952-b156-74df49da9151 >>>>> =20 >>>>> INFO : Completed compiling = command(queryId=3Dhduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c0= 06); Time taken: 1.911 seconds >>>>> INFO : Executing = command(queryId=3Dhduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c0= 06): select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>> INFO : Query ID =3D = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >>>>> INFO : Total jobs =3D 1 >>>>> INFO : Launching Job 1 out of 1 >>>>> INFO : Starting task [Stage-1:MAPRED] in serial mode >>>>> =20 >>>>> Query Hive on Spark job[0] stages: >>>>> 0 >>>>> 1 >>>>> Status: Running (Hive on Spark job[0]) >>>>> Job Progress Format >>>>> CurrentTime StageId_StageAttemptId: = SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount = [StageCost] >>>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >>>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 >>>>> INFO : >>>>> Query Hive on Spark job[0] stages: >>>>> INFO : 0 >>>>> INFO : 1 >>>>> INFO : >>>>> Status: Running (Hive on Spark job[0]) >>>>> INFO : Job Progress Format >>>>> CurrentTime StageId_StageAttemptId: = SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount = [StageCost] >>>>> INFO : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >>>>> INFO : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: = 0/1 >>>>> INFO : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: = 0/1 >>>>> INFO : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: = 0/1 >>>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished Stage-1_0: = 0(+1)/1 >>>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished Stage-1_0: = 1/1 Finished >>>>> Status: Finished successfully in 53.25 seconds >>>>> OK >>>>> INFO : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished = Stage-1_0: 0(+1)/1 >>>>> INFO : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished = Stage-1_0: 1/1 Finished >>>>> INFO : Status: Finished successfully in 53.25 seconds >>>>> INFO : Completed executing = command(queryId=3Dhduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c0= 06); Time taken: 56.337 seconds >>>>> INFO : OK >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> | c0 | c1 | c2 | c3 | >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> 1 row selected (58.529 seconds) >>>>> =20 >>>>> 58 seconds first run with cold cache is pretty good >>>>> =20 >>>>> And let us compare it with running the same query on map-reduce = engine >>>>> =20 >>>>> : jdbc:hive2://rhes564:10010/default> set = hive.execution.engine=3Dmr; >>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the = future versions. Consider using a different execution engine (i.e. = spark, tez) or using Hive 1.X releases. >>>>> No rows affected (0.007 seconds) >>>>> 0: jdbc:hive2://rhes564:10010/default> select min(id), = max(id),avg(id), stddev(id) from oraclehadoop.dummy; >>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be = available in the future versions. Consider using a different execution = engine (i.e. spark, tez) or using Hive 1.X releases. >>>>> Query ID =3D = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >>>>> Total jobs =3D 1 >>>>> Launching Job 1 out of 1 >>>>> Number of reduce tasks determined at compile time: 1 >>>>> In order to change the average load for a reducer (in bytes): >>>>> set hive.exec.reducers.bytes.per.reducer=3D >>>>> In order to limit the maximum number of reducers: >>>>> set hive.exec.reducers.max=3D >>>>> In order to set a constant number of reducers: >>>>> set mapreduce.job.reduces=3D >>>>> Starting Job =3D job_1463956731753_0005, Tracking URL =3D = http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ = >>>>> Kill Command =3D /home/hduser/hadoop-2.6.0/bin/hadoop job -kill = job_1463956731753_0005 >>>>> Hadoop job information for Stage-1: number of mappers: 22; number = of reducers: 1 >>>>> 2016-05-23 00:26:38,127 Stage-1 map =3D 0%, reduce =3D 0% >>>>> INFO : Compiling = command(queryId=3Dhduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41= dc): select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>> INFO : Semantic Analysis Completed >>>>> INFO : Returning Hive schema: = Schema(fieldSchemas:[FieldSchema(name:c0, type:int, comment:null), = FieldSchema(name:c1, type:int, comment:null), FieldSchema(name:c2, = type:double, comment:null), FieldSchema(name:c3, type:double, = comment:null)], properties:null) >>>>> INFO : Completed compiling = command(queryId=3Dhduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41= dc); Time taken: 0.144 seconds >>>>> INFO : Executing = command(queryId=3Dhduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41= dc): select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>> WARN : Hive-on-MR is deprecated in Hive 2 and may not be = available in the future versions. Consider using a different execution = engine (i.e. spark, tez) or using Hive 1.X releases. >>>>> INFO : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be = available in the future versions. Consider using a different execution = engine (i.e. spark, tez) or using Hive 1.X releases. >>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be = available in the future versions. Consider using a different execution = engine (i.e. spark, tez) or using Hive 1.X releases. >>>>> INFO : Query ID =3D = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >>>>> INFO : Total jobs =3D 1 >>>>> INFO : Launching Job 1 out of 1 >>>>> INFO : Starting task [Stage-1:MAPRED] in serial mode >>>>> INFO : Number of reduce tasks determined at compile time: 1 >>>>> INFO : In order to change the average load for a reducer (in = bytes): >>>>> INFO : set hive.exec.reducers.bytes.per.reducer=3D >>>>> INFO : In order to limit the maximum number of reducers: >>>>> INFO : set hive.exec.reducers.max=3D >>>>> INFO : In order to set a constant number of reducers: >>>>> INFO : set mapreduce.job.reduces=3D >>>>> WARN : Hadoop command-line option parsing not performed. = Implement the Tool interface and execute your application with = ToolRunner to remedy this. >>>>> INFO : number of splits:22 >>>>> INFO : Submitting tokens for job: job_1463956731753_0005 >>>>> INFO : The url to track the job: = http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ = >>>>> INFO : Starting Job =3D job_1463956731753_0005, Tracking URL =3D = http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ = >>>>> INFO : Kill Command =3D /home/hduser/hadoop-2.6.0/bin/hadoop job = -kill job_1463956731753_0005 >>>>> INFO : Hadoop job information for Stage-1: number of mappers: 22; = number of reducers: 1 >>>>> INFO : 2016-05-23 00:26:38,127 Stage-1 map =3D 0%, reduce =3D 0% >>>>> 2016-05-23 00:26:44,367 Stage-1 map =3D 5%, reduce =3D 0%, = Cumulative CPU 4.56 sec >>>>> INFO : 2016-05-23 00:26:44,367 Stage-1 map =3D 5%, reduce =3D = 0%, Cumulative CPU 4.56 sec >>>>> 2016-05-23 00:26:50,558 Stage-1 map =3D 9%, reduce =3D 0%, = Cumulative CPU 9.17 sec >>>>> INFO : 2016-05-23 00:26:50,558 Stage-1 map =3D 9%, reduce =3D = 0%, Cumulative CPU 9.17 sec >>>>> 2016-05-23 00:26:56,747 Stage-1 map =3D 14%, reduce =3D 0%, = Cumulative CPU 14.04 sec >>>>> INFO : 2016-05-23 00:26:56,747 Stage-1 map =3D 14%, reduce =3D = 0%, Cumulative CPU 14.04 sec >>>>> 2016-05-23 00:27:02,944 Stage-1 map =3D 18%, reduce =3D 0%, = Cumulative CPU 18.64 sec >>>>> INFO : 2016-05-23 00:27:02,944 Stage-1 map =3D 18%, reduce =3D = 0%, Cumulative CPU 18.64 sec >>>>> 2016-05-23 00:27:08,105 Stage-1 map =3D 23%, reduce =3D 0%, = Cumulative CPU 23.25 sec >>>>> INFO : 2016-05-23 00:27:08,105 Stage-1 map =3D 23%, reduce =3D = 0%, Cumulative CPU 23.25 sec >>>>> 2016-05-23 00:27:14,298 Stage-1 map =3D 27%, reduce =3D 0%, = Cumulative CPU 27.84 sec >>>>> INFO : 2016-05-23 00:27:14,298 Stage-1 map =3D 27%, reduce =3D = 0%, Cumulative CPU 27.84 sec >>>>> 2016-05-23 00:27:20,484 Stage-1 map =3D 32%, reduce =3D 0%, = Cumulative CPU 32.56 sec >>>>> INFO : 2016-05-23 00:27:20,484 Stage-1 map =3D 32%, reduce =3D = 0%, Cumulative CPU 32.56 sec >>>>> 2016-05-23 00:27:26,659 Stage-1 map =3D 36%, reduce =3D 0%, = Cumulative CPU 37.1 sec >>>>> INFO : 2016-05-23 00:27:26,659 Stage-1 map =3D 36%, reduce =3D = 0%, Cumulative CPU 37.1 sec >>>>> 2016-05-23 00:27:32,839 Stage-1 map =3D 41%, reduce =3D 0%, = Cumulative CPU 41.74 sec >>>>> INFO : 2016-05-23 00:27:32,839 Stage-1 map =3D 41%, reduce =3D = 0%, Cumulative CPU 41.74 sec >>>>> 2016-05-23 00:27:39,003 Stage-1 map =3D 45%, reduce =3D 0%, = Cumulative CPU 46.32 sec >>>>> INFO : 2016-05-23 00:27:39,003 Stage-1 map =3D 45%, reduce =3D = 0%, Cumulative CPU 46.32 sec >>>>> 2016-05-23 00:27:45,173 Stage-1 map =3D 50%, reduce =3D 0%, = Cumulative CPU 50.93 sec >>>>> 2016-05-23 00:27:50,316 Stage-1 map =3D 55%, reduce =3D 0%, = Cumulative CPU 55.55 sec >>>>> INFO : 2016-05-23 00:27:45,173 Stage-1 map =3D 50%, reduce =3D = 0%, Cumulative CPU 50.93 sec >>>>> INFO : 2016-05-23 00:27:50,316 Stage-1 map =3D 55%, reduce =3D = 0%, Cumulative CPU 55.55 sec >>>>> 2016-05-23 00:27:56,482 Stage-1 map =3D 59%, reduce =3D 0%, = Cumulative CPU 60.25 sec >>>>> INFO : 2016-05-23 00:27:56,482 Stage-1 map =3D 59%, reduce =3D = 0%, Cumulative CPU 60.25 sec >>>>> 2016-05-23 00:28:02,642 Stage-1 map =3D 64%, reduce =3D 0%, = Cumulative CPU 64.86 sec >>>>> INFO : 2016-05-23 00:28:02,642 Stage-1 map =3D 64%, reduce =3D = 0%, Cumulative CPU 64.86 sec >>>>> 2016-05-23 00:28:08,814 Stage-1 map =3D 68%, reduce =3D 0%, = Cumulative CPU 69.41 sec >>>>> INFO : 2016-05-23 00:28:08,814 Stage-1 map =3D 68%, reduce =3D = 0%, Cumulative CPU 69.41 sec >>>>> 2016-05-23 00:28:14,977 Stage-1 map =3D 73%, reduce =3D 0%, = Cumulative CPU 74.06 sec >>>>> INFO : 2016-05-23 00:28:14,977 Stage-1 map =3D 73%, reduce =3D = 0%, Cumulative CPU 74.06 sec >>>>> 2016-05-23 00:28:21,134 Stage-1 map =3D 77%, reduce =3D 0%, = Cumulative CPU 78.72 sec >>>>> INFO : 2016-05-23 00:28:21,134 Stage-1 map =3D 77%, reduce =3D = 0%, Cumulative CPU 78.72 sec >>>>> 2016-05-23 00:28:27,282 Stage-1 map =3D 82%, reduce =3D 0%, = Cumulative CPU 83.32 sec >>>>> INFO : 2016-05-23 00:28:27,282 Stage-1 map =3D 82%, reduce =3D = 0%, Cumulative CPU 83.32 sec >>>>> 2016-05-23 00:28:33,437 Stage-1 map =3D 86%, reduce =3D 0%, = Cumulative CPU 87.9 sec >>>>> INFO : 2016-05-23 00:28:33,437 Stage-1 map =3D 86%, reduce =3D = 0%, Cumulative CPU 87.9 sec >>>>> 2016-05-23 00:28:38,579 Stage-1 map =3D 91%, reduce =3D 0%, = Cumulative CPU 92.52 sec >>>>> INFO : 2016-05-23 00:28:38,579 Stage-1 map =3D 91%, reduce =3D = 0%, Cumulative CPU 92.52 sec >>>>> 2016-05-23 00:28:44,759 Stage-1 map =3D 95%, reduce =3D 0%, = Cumulative CPU 97.35 sec >>>>> INFO : 2016-05-23 00:28:44,759 Stage-1 map =3D 95%, reduce =3D = 0%, Cumulative CPU 97.35 sec >>>>> 2016-05-23 00:28:49,915 Stage-1 map =3D 100%, reduce =3D 0%, = Cumulative CPU 99.6 sec >>>>> INFO : 2016-05-23 00:28:49,915 Stage-1 map =3D 100%, reduce =3D = 0%, Cumulative CPU 99.6 sec >>>>> 2016-05-23 00:28:54,043 Stage-1 map =3D 100%, reduce =3D 100%, = Cumulative CPU 101.4 sec >>>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec >>>>> Ended Job =3D job_1463956731753_0005 >>>>> MapReduce Jobs Launched: >>>>> Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec = HDFS Read: 5318569 HDFS Write: 46 SUCCESS >>>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec >>>>> OK >>>>> INFO : 2016-05-23 00:28:54,043 Stage-1 map =3D 100%, reduce =3D = 100%, Cumulative CPU 101.4 sec >>>>> INFO : MapReduce Total cumulative CPU time: 1 minutes 41 seconds = 400 msec >>>>> INFO : Ended Job =3D job_1463956731753_0005 >>>>> INFO : MapReduce Jobs Launched: >>>>> INFO : Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 = sec HDFS Read: 5318569 HDFS Write: 46 SUCCESS >>>>> INFO : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 = msec >>>>> INFO : Completed executing = command(queryId=3Dhduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41= dc); Time taken: 142.525 seconds >>>>> INFO : OK >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> | c0 | c1 | c2 | c3 | >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> 1 row selected (142.744 seconds) >>>>> =20 >>>>> OK Hive on map-reduce engine took 142 seconds compared to 58 = seconds with Hive on Spark. So you can obviously gain pretty well by = using Hive on Spark. >>>>> =20 >>>>> Please also note that I did not use any vendor's build for this = purpose. I compiled Spark 1.3.1 myself. >>>>> =20 >>>>> HTH >>>>> =20 >>>>> =20 >>>>> Dr Mich Talebzadeh >>>>> =20 >>>>> LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdO= ABUrV8Pw = >>>>> =20 >>>>> http://talebzadehmich.wordpress.com/ = >>>>> =20 >>>>=20 >>>>=20 >>>>=20 >>>> --=20 >>>> Best Regards, >>>> Ayan Guha >>>>=20 >>>>=20 >>>=20 >>=20 >=20 >=20 --Apple-Mail=_2AC399B7-81D0-433B-BEA1-BAAC1492A6EC Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8" And you have MapR supporting Apache Drill. 

So these are all = alternatives to Spark, and its not necessarily an either or scenario. = You can have both. 

On = May 30, 2016, at 12:49 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:

yep Hortonworks supports Tez for one = reason or other which I am going hopefully to test it as the query = engine for hive. Tthough I think Spark will be faster because of its = in-memory support.

Also if you are independent then you better off dealing with = Spark and Hive without the need to support another stack like Tez. =

Cloudera = support Impala instead of Hive but it is not something I have used. = .

HTH


On 30 May 2016 at 20:19, = Michael Segel <msegel_hadoop@hotmail.com> wrote:
Mich, 

Most people use vendor = releases because they need to have the support. 
Hortonworks is the vendor who has the most skin in the game = when it comes to Tez. 

If memory serves, Tez isn=E2=80=99t going to be M/R but a = local execution engine? Then LLAP is the in-memory piece to speed up = Tez? 

HTH

-Mike

On May = 29, 2016, at 1:35 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:

thanks I think the problem is that the TEZ user group is = exceptionally quiet. Just sent an email to Hive user group to see anyone = has managed to built a vendor independent version.



On 29 May 2016 at 21:23, J=C3=B6= rn Franke <jornfranke@gmail.com> wrote:
Well I think it is different from MR. It has some = optimizations which you do not find in MR. Especially the LLAP option in = Hive2 makes it interesting. 

I think hive 1.2 works with 0.7 and 2.0 = with 0.8 . At least for 1.2 it is integrated in the Hortonworks = distribution. 


On 29 May = 2016, at 21:43, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:

Hi Jorn,

I started building apache-tez-0.8.2 but = got few errors. Couple of guys from TEZ user group kindly gave a hand = but I could not go very far (or may be I did not make enough efforts) = making it work.

That TEZ user group is very quiet as well.

My understanding is TEZ = is MR with DAG but of course Spark has both plus in-memory capability. =

It would be = interesting to see what version of TEZ works as execution engine with = Hive.

Vendors = are divided on this (use Hive with TEZ) or use Impala instead of Hive = etc as I am sure you already know.

Cheers,





On 29 May 2016 at 20:19, J=C3=B6= rn Franke <jornfranke@gmail.com> wrote:
Very interesting do you plan also a test with TEZ?

On 29 May 2016, at 13:40, Mich = Talebzadeh <mich.talebzadeh@gmail.com> wrote:

Hi,

I did another study of Hive using Spark = engine compared to Hive with MR.

Basically took the original table = imported using Sqoop and created and populated a new ORC table = partitioned by year and month into 48 partitions as follows:

<sales_partition.PNG>
=E2=80=8B 
=
Connections use JDBC via beeline. Now = for each partition using MR it takes an average of 17 minutes as seen = below for each PARTITION..  Now that is just an individual = partition and there are 48 partitions.

In contrast doing the same operation = with Spark engine took 10 minutes all inclusive. I just gave up on MR. = You can see the StartTime and FinishTime from below

<image.png>

This is by no means = indicate that Spark is much better than MR but shows that some very good = results can ve achieved using Spark engine.



On 24 May 2016 at 08:03, Mich = Talebzadeh <mich.talebzadeh@gmail.com> wrote:
Hi,

We use Hive as the database and use Spark as an all purpose = query tool.

Whether Hive is the write database for purpose or one is = better off with something like Phoenix on Hbase, well the answer is it = depends and your mileage varies. 

So fit for purpose.

Ideally what wants is to = use the fastest  method to get the results. How fast we confine it = to our SLA agreements in production and that helps us from unnecessary = further work as we technologists like to play around.

So in short, we use = Spark most of the time and use Hive as the backend engine for data = storage, mainly ORC tables.

We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now = we have a combination that works. Granted it helps to use Hive 2 on = Spark 1.6.1 but at the moment it is one of my projects.

We do not use any = vendor's products as it enables us to move away  from being tied = down after years of SAP, Oracle and MS dependency to yet another vendor. = Besides there is some politics going on with one promoting Tez and = another Spark as a backend. That is fine but obviously we prefer an = independent assessment ourselves.

My gut feeling is that one needs to = look at the use case. Recently we had to import a very large table from = Oracle to Hive and decided to use Spark 1.6.1 with Hive 2 on Spark 1.3.1 = and that worked fine. We just used JDBC connection with temp table and = it was good. We could have used sqoop but decided to settle for Spark so = it all depends on use case.

HTH




On 24 May 2016 at 03:11, ayan guha <guha.ayan@gmail.com> wrote:
Hi

Thanks for very useful = stats. 

Did= you have any benchmark for using Spark as backend engine for Hive vs = using Spark thrift server (and run spark code for hive queries)? We are = using later but it will be very useful to remove thriftserver, if we = can. 

On Tue, May 24, = 2016 at 9:51 AM, J=C3=B6rn Franke <jornfranke@gmail.com> wrote:

Hi Mich,

I think these comparisons are useful. = One interesting aspect could be hardware scalability in this context. = Additionally different type of computations. Furthermore, one could = compare Spark and Tez+llap as execution engines. I have the gut feeling = that  each one can be justified by different use cases.
Nevertheless, there should be always a disclaimer for such = comparisons, because Spark and Hive are not good for a lot of concurrent = lookups of single rows. They are not good for frequently write small = amounts of data (eg sensor data). Here hbase could be more interesting. = Other use cases can justify graph databases, such as Titan, or text = analytics/ data matching using Solr on Hadoop.
Finally, even if you have a lot of data you need to think if = you always have to process everything. For instance, I have found valid = use cases in practice where we decided to evaluate 10 machine learning = models in parallel on only a sample of data and only evaluate the = "winning" model of the total of data.

As always it depends :) 

Best regards

P.s.: at least = Hortonworks has in their distribution spark 1.5 with hive 1.2 and spark = 1.6 with hive 1.2. Maybe they have somewhere described how to manage = bringing both together. You may check also Apache Bigtop (vendor neutral = distribution) on how they managed to bring both together.

On 23 May = 2016, at 01:42, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:

Hi,

 

I have done a number of extensive tests using Spark-shell with Hive DB and = ORC tables.

 

Now one issue that we typically face is and I = quote:

 

Spark is fast as it uses Memory and DAG. Great but when we save data it = is not fast enough

OK but there is a solution now. If you use Spark with Hive and you are on a descent version of Hive >=3D 0.14, then you can also deploy Spark as = execution engine for Hive. That will make your application run pretty fast as you = no longer rely on the old Map-Reduce for Hive engine. In a nutshell what = you are gaining speed in both querying and storage.

 

I have made some comparisons on this set-up and I am sure some of you will = find it useful.

 

The version of Spark I use for Spark queries (Spark as query tool) is = 1.6.
The version of Hive I use in Hive 2
The version of Spark I use as Hive execution engine is 1.3.1 It works and = frankly Spark 1.3.1 as an execution engine is adequate (until we sort out the = Hadoop libraries mismatch).

 

An example I am using Hive on Spark engine to find the min and max of IDs = for a table with 1 billion rows:

 

0: jdbc:hive2://rhes564:10010/default>  = select min(id), max(id),avg(id), stddev(id) from = oraclehadoop.dummy;
Query ID =3D = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006

 

 

Starting Spark Job =3D = 5e092ef9-d798-4952-b156-74df49da9151

 

INFO  : Completed compiling = command(queryId=3Dhduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c0= 06); Time taken: 1.911 seconds
INFO  : Executing = command(queryId=3Dhduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c0= 06): select min(id), max(id),avg(id), stddev(id) from = oraclehadoop.dummy
INFO  : Query ID =3D = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
INFO  : Total jobs =3D = 1
=
INFO  : Launching Job 1 out of = 1
=
INFO  : Starting task [Stage-1:MAPRED] in = serial mode

 

Query Hive on Spark job[0] = stages:
0
1
Status: Running (Hive on Spark = job[0])
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: = 0/1
2016-05-23 00:21:20,070 Stage-0_0: = 0(+12)/22    Stage-1_0: 0/1
2016-05-23 00:21:23,119 Stage-0_0: = 0(+12)/22    Stage-1_0: 0/1
2016-05-23 00:21:26,156 Stage-0_0: = 13(+9)/22    Stage-1_0: 0/1
INFO  :
Query Hive on Spark job[0] = stages:
INFO  : 0
INFO  : 1
INFO  :
Status: Running (Hive on Spark = job[0])
INFO  : Job Progress = Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
INFO  : 2016-05-23 00:21:19,062 Stage-0_0: = 0/22 Stage-1_0: 0/1
INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0: = 0(+1)/1
2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 1/1 = Finished
Status: Finished successfully in 53.25 = seconds
OK
INFO  : 2016-05-23 00:21:29,181 Stage-0_0: = 22/22 Finished       Stage-1_0: = 0(+1)/1
INFO  : 2016-05-23 00:21:30,189 Stage-0_0: = 22/22 Finished       Stage-1_0: 1/1 = Finished
INFO  : Status: Finished successfully in = 53.25 seconds
INFO  : Completed executing = command(queryId=3Dhduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c0= 06); Time taken: 56.337 seconds
INFO  : OK
+-----+------------+---------------+-----------------------+--+=
| c0  |     = c1     |      c2       |          c3           = |
=
+-----+------------+---------------+-----------------------+--+=
| 1   | 100000000  | = 5.00000005E7  | 2.8867513459481288E7  |
+-----+------------+---------------+-----------------------+--+=
1 row selected (58.529 = seconds)

 

58 seconds first run with = cold cache is pretty good

 

And let us compare it with = running the same query on map-reduce engine

 

: jdbc:hive2://rhes564:10010/default> set = hive.execution.engine=3Dmr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. = Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No = rows affected (0.007 seconds)
0: jdbc:hive2://rhes564:10010/default>  select min(id), = max(id),avg(id), stddev(id) from oraclehadoop.dummy;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the = future versions. Consider using a different execution engine (i.e. spark, tez) = or using Hive 1.X releases.
Query ID =3D = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
Total jobs =3D 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In = order to change the average load for a reducer (in bytes):
  set = hive.exec.reducers.bytes.per.reducer=3D<number>
<= font face=3D"Times New Roman" size=3D"3" class=3D"">
In = order to limit the maximum number of reducers:
  set hive.exec.reducers.max=3D<number>
In = order to set a constant number of reducers:
  set mapreduce.job.reduces=3D<number>
Starting Job =3D job_1463956731753_0005, Tracking URL =3D http://localhost.localdomain:8088/proxy/application_14639567317= 53_0005/
Kill= Command =3D /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill = job_1463956731753_0005
Hadoop job information for Stage-1: number of mappers: 22; number of reducers: = 1
=
2016-05-23 00:26:38,127 Stage-1 map =3D 0%,  reduce =3D = 0%
INFO  : Compiling = command(queryId=3Dhduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41= dc): select min(id), max(id),avg(id), stddev(id) from = oraclehadoop.dummy
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0, = type:int, comment:null), FieldSchema(name:c1, type:int, comment:null), = FieldSchema(name:c2, type:double, comment:null), FieldSchema(name:c3, type:double, = comment:null)], properties:null)
INFO  : Completed compiling = command(queryId=3Dhduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41= dc); Time taken: 0.144 seconds
INFO  : Executing = command(queryId=3Dhduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41= dc): select min(id), max(id),avg(id), stddev(id) from = oraclehadoop.dummy
WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the = future versions. Consider using a different execution engine (i.e. spark, tez) = or using Hive 1.X releases.
INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in = the future versions. Consider using a different execution engine (i.e. = spark, tez) or using Hive 1.X releases.
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the = future versions. Consider using a different execution engine (i.e. spark, tez) = or using Hive 1.X releases.
INFO  : Query ID =3D = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
INFO  : Total jobs =3D 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks determined at compile time: = 1
=
INFO  : In order to change the average load for a reducer (in = bytes):
INFO  :   set = hive.exec.reducers.bytes.per.reducer=3D<number>
<= font face=3D"Times New Roman" size=3D"3" class=3D"">
INFO  : In order to limit the maximum number of reducers:
INFO  :   set = hive.exec.reducers.max=3D<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set = mapreduce.job.reduces=3D<number>
WARN  : Hadoop command-line option parsing not performed. Implement the Tool = interface and execute your application with ToolRunner to remedy = this.
INFO  : number of splits:22
INFO  : Submitting tokens for job: = job_1463956731753_0005
INFO  : Starting Job =3D job_1463956731753_0005, Tracking URL =3D http://localhost.localdomain:8088/proxy/application_14639567317= 53_0005/
INFO  : Kill Command =3D /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill job_1463956731753_0005
INFO  : Hadoop job information for Stage-1: number of mappers: 22; number of = reducers: 1
=
INFO  : 2016-05-23 00:26:38,127 Stage-1 map =3D 0%,  reduce =3D = 0%
2016-05-23 00:26:44,367 Stage-1 map =3D 5%,  reduce =3D 0%, Cumulative CPU = 4.56 sec
INFO  : 2016-05-23 00:26:44,367 Stage-1 map =3D 5%,  reduce =3D 0%, = Cumulative CPU 4.56 sec
2016-05-23 00:26:50,558 Stage-1 map =3D 9%,  reduce =3D 0%, Cumulative CPU = 9.17 sec
INFO  : 2016-05-23 00:26:50,558 Stage-1 map =3D 9%,  reduce =3D 0%, = Cumulative CPU 9.17 sec
2016-05-23 00:26:56,747 Stage-1 map =3D 14%,  reduce =3D 0%, Cumulative CPU = 14.04 sec
INFO  : 2016-05-23 00:26:56,747 Stage-1 map =3D 14%,  reduce =3D 0%, = Cumulative CPU 14.04 sec
2016-05-23 00:27:02,944 Stage-1 map =3D 18%,  reduce =3D 0%, Cumulative CPU = 18.64 sec
INFO  : 2016-05-23 00:27:02,944 Stage-1 map =3D 18%,  reduce =3D 0%, = Cumulative CPU 18.64 sec
2016-05-23 00:27:08,105 Stage-1 map =3D 23%,  reduce =3D 0%, Cumulative CPU = 23.25 sec
INFO  : 2016-05-23 00:27:08,105 Stage-1 map =3D 23%,  reduce =3D 0%, = Cumulative CPU 23.25 sec
2016-05-23 00:27:14,298 Stage-1 map =3D 27%,  reduce =3D 0%, Cumulative CPU = 27.84 sec
INFO  : 2016-05-23 00:27:14,298 Stage-1 map =3D 27%,  reduce =3D 0%, = Cumulative CPU 27.84 sec
2016-05-23 00:27:20,484 Stage-1 map =3D 32%,  reduce =3D 0%, Cumulative CPU = 32.56 sec
INFO  : 2016-05-23 00:27:20,484 Stage-1 map =3D 32%,  reduce =3D 0%, = Cumulative CPU 32.56 sec
2016-05-23 00:27:26,659 Stage-1 map =3D 36%,  reduce =3D 0%, Cumulative CPU = 37.1 sec
INFO  : 2016-05-23 00:27:26,659 Stage-1 map =3D 36%,  reduce =3D 0%, = Cumulative CPU 37.1 sec
2016-05-23 00:27:32,839 Stage-1 map =3D 41%,  reduce =3D 0%, Cumulative CPU = 41.74 sec
INFO  : 2016-05-23 00:27:32,839 Stage-1 map =3D 41%,  reduce =3D 0%, = Cumulative CPU 41.74 sec
2016-05-23 00:27:39,003 Stage-1 map =3D 45%,  reduce =3D 0%, Cumulative CPU = 46.32 sec
INFO  : 2016-05-23 00:27:39,003 Stage-1 map =3D 45%,  reduce =3D 0%, = Cumulative CPU 46.32 sec
2016-05-23 00:27:45,173 Stage-1 map =3D 50%,  reduce =3D 0%, Cumulative CPU = 50.93 sec
2016-05-23 00:27:50,316 Stage-1 map =3D 55%,  reduce =3D 0%, Cumulative CPU = 55.55 sec
INFO  : 2016-05-23 00:27:45,173 Stage-1 map =3D 50%,  reduce =3D 0%, = Cumulative CPU 50.93 sec
INFO  : 2016-05-23 00:27:50,316 Stage-1 map =3D 55%,  reduce =3D 0%, = Cumulative CPU 55.55 sec
2016-05-23 00:27:56,482 Stage-1 map =3D 59%,  reduce =3D 0%, Cumulative CPU = 60.25 sec
INFO  : 2016-05-23 00:27:56,482 Stage-1 map =3D 59%,  reduce =3D 0%, = Cumulative CPU 60.25 sec
2016-05-23 00:28:02,642 Stage-1 map =3D 64%,  reduce =3D 0%, Cumulative CPU = 64.86 sec
INFO  : 2016-05-23 00:28:02,642 Stage-1 map =3D 64%,  reduce =3D 0%, = Cumulative CPU 64.86 sec
2016-05-23 00:28:08,814 Stage-1 map =3D 68%,  reduce =3D 0%, Cumulative CPU = 69.41 sec
INFO  : 2016-05-23 00:28:08,814 Stage-1 map =3D 68%,  reduce =3D 0%, = Cumulative CPU 69.41 sec
2016-05-23 00:28:14,977 Stage-1 map =3D 73%,  reduce =3D 0%, Cumulative CPU = 74.06 sec
INFO  : 2016-05-23 00:28:14,977 Stage-1 map =3D 73%,  reduce =3D 0%, = Cumulative CPU 74.06 sec
2016-05-23 00:28:21,134 Stage-1 map =3D 77%,  reduce =3D 0%, Cumulative CPU = 78.72 sec
INFO  : 2016-05-23 00:28:21,134 Stage-1 map =3D 77%,  reduce =3D 0%, = Cumulative CPU 78.72 sec
2016-05-23 00:28:27,282 Stage-1 map =3D 82%,  reduce =3D 0%, Cumulative CPU = 83.32 sec
INFO  : 2016-05-23 00:28:27,282 Stage-1 map =3D 82%,  reduce =3D 0%, = Cumulative CPU 83.32 sec
2016-05-23 00:28:33,437 Stage-1 map =3D 86%,  reduce =3D 0%, Cumulative CPU = 87.9 sec
INFO  : 2016-05-23 00:28:33,437 Stage-1 map =3D 86%,  reduce =3D 0%, = Cumulative CPU 87.9 sec
2016-05-23 00:28:38,579 Stage-1 map =3D 91%,  reduce =3D 0%, Cumulative CPU = 92.52 sec
INFO  : 2016-05-23 00:28:38,579 Stage-1 map =3D 91%,  reduce =3D 0%, = Cumulative CPU 92.52 sec
2016-05-23 00:28:44,759 Stage-1 map =3D 95%,  reduce =3D 0%, Cumulative CPU = 97.35 sec
INFO  : 2016-05-23 00:28:44,759 Stage-1 map =3D 95%,  reduce =3D 0%, = Cumulative CPU 97.35 sec
2016-05-23 00:28:49,915 Stage-1 map =3D 100%,  reduce =3D 0%, Cumulative CPU = 99.6 sec
INFO  : 2016-05-23 00:28:49,915 Stage-1 map =3D 100%,  reduce =3D 0%, = Cumulative CPU 99.6 sec
2016-05-23 00:28:54,043 Stage-1 map =3D 100%,  reduce =3D 100%, Cumulative CPU = 101.4 sec
MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 = msec
Ended Job =3D job_1463956731753_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 = sec   HDFS Read: 5318569 HDFS Write: 46 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
OK
INFO  : 2016-05-23 00:28:54,043 Stage-1 map =3D 100%,  reduce =3D 100%, = Cumulative CPU 101.4 sec
INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 = msec
INFO  : Ended Job =3D job_1463956731753_0005
INFO  : MapReduce Jobs Launched:
INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS Read: 5318569 HDFS Write: 46 = SUCCESS
INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 = msec
INFO  : Completed executing = command(queryId=3Dhduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41= dc); Time taken: 142.525 seconds
INFO  : OK
+-----+------------+---------------+-----------------------+--+=
| = c0  |     c1     |      c2       |          c3           = |
=
+-----+------------+---------------+-----------------------+--+=
| = 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  = |
=
+-----+------------+---------------+-----------------------+--+=
1 = row selected (142.744 seconds)

 

OK Hive on map-reduce = engine took 142 seconds compared to 58 seconds with Hive on Spark. So you can = obviously gain pretty well by using Hive on Spark.

 

Please also note that I did not = use any vendor's build for this purpose. I compiled Spark 1.3.1 = myself.

 

HTH

 

 

Dr Mich = Talebzadeh

 

 

 



--
Best Regards,
Ayan Guha







= --Apple-Mail=_2AC399B7-81D0-433B-BEA1-BAAC1492A6EC--