hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <msegel_had...@hotmail.com>
Subject Re: Using Spark on Hive with Hive also using Spark as its execution engine
Date Mon, 30 May 2016 20:59:48 GMT
And you have MapR supporting Apache Drill. 

So these are all alternatives to Spark, and its not necessarily an either or scenario. You
can have both. 

> On May 30, 2016, at 12:49 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
> 
> yep Hortonworks supports Tez for one reason or other which I am going hopefully to test
it as the query engine for hive. Tthough I think Spark will be faster because of its in-memory
support.
> 
> Also if you are independent then you better off dealing with Spark and Hive without the
need to support another stack like Tez.
> 
> Cloudera support Impala instead of Hive but it is not something I have used. .
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 30 May 2016 at 20:19, Michael Segel <msegel_hadoop@hotmail.com <mailto:msegel_hadoop@hotmail.com>>
wrote:
> Mich, 
> 
> Most people use vendor releases because they need to have the support. 
> Hortonworks is the vendor who has the most skin in the game when it comes to Tez. 
> 
> If memory serves, Tez isn’t going to be M/R but a local execution engine? Then LLAP
is the in-memory piece to speed up Tez? 
> 
> HTH
> 
> -Mike
> 
>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
>> 
>> thanks I think the problem is that the TEZ user group is exceptionally quiet. Just
sent an email to Hive user group to see anyone has managed to built a vendor independent version.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 29 May 2016 at 21:23, Jörn Franke <jornfranke@gmail.com <mailto:jornfranke@gmail.com>>
wrote:
>> Well I think it is different from MR. It has some optimizations which you do not
find in MR. Especially the LLAP option in Hive2 makes it interesting. 
>> 
>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is integrated
in the Hortonworks distribution. 
>> 
>> 
>> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
>> 
>>> Hi Jorn,
>>> 
>>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from TEZ
user group kindly gave a hand but I could not go very far (or may be I did not make enough
efforts) making it work.
>>> 
>>> That TEZ user group is very quiet as well.
>>> 
>>> My understanding is TEZ is MR with DAG but of course Spark has both plus in-memory
capability.
>>> 
>>> It would be interesting to see what version of TEZ works as execution engine
with Hive.
>>> 
>>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of Hive
etc as I am sure you already know.
>>> 
>>> Cheers,
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 29 May 2016 at 20:19, Jörn Franke <jornfranke@gmail.com <mailto:jornfranke@gmail.com>>
wrote:
>>> Very interesting do you plan also a test with TEZ?
>>> 
>>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>>> 
>>>> Basically took the original table imported using Sqoop and created and populated
a new ORC table partitioned by year and month into 48 partitions as follows:
>>>> 
>>>> <sales_partition.PNG>
>>>> ​ 
>>>> Connections use JDBC via beeline. Now for each partition using MR it takes
an average of 17 minutes as seen below for each PARTITION..  Now that is just an individual
partition and there are 48 partitions.
>>>> 
>>>> In contrast doing the same operation with Spark engine took 10 minutes all
inclusive. I just gave up on MR. You can see the StartTime and FinishTime from below
>>>> 
>>>> <image.png>
>>>> 
>>>> This is by no means indicate that Spark is much better than MR but shows
that some very good results can ve achieved using Spark engine.
>>>> 
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>  
>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>  
>>>> 
>>>> On 24 May 2016 at 08:03, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
>>>> Hi,
>>>> 
>>>> We use Hive as the database and use Spark as an all purpose query tool.
>>>> 
>>>> Whether Hive is the write database for purpose or one is better off with
something like Phoenix on Hbase, well the answer is it depends and your mileage varies. 
>>>> 
>>>> So fit for purpose.
>>>> 
>>>> Ideally what wants is to use the fastest  method to get the results. How
fast we confine it to our SLA agreements in production and that helps us from unnecessary
further work as we technologists like to play around.
>>>> 
>>>> So in short, we use Spark most of the time and use Hive as the backend engine
for data storage, mainly ORC tables.
>>>> 
>>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a combination
that works. Granted it helps to use Hive 2 on Spark 1.6.1 but at the moment it is one of my
projects.
>>>> 
>>>> We do not use any vendor's products as it enables us to move away  from being
tied down after years of SAP, Oracle and MS dependency to yet another vendor. Besides there
is some politics going on with one promoting Tez and another Spark as a backend. That is fine
but obviously we prefer an independent assessment ourselves.
>>>> 
>>>> My gut feeling is that one needs to look at the use case. Recently we had
to import a very large table from Oracle to Hive and decided to use Spark 1.6.1 with Hive
2 on Spark 1.3.1 and that worked fine. We just used JDBC connection with temp table and it
was good. We could have used sqoop but decided to settle for Spark so it all depends on use
case.
>>>> 
>>>> HTH
>>>> 
>>>> 
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>  
>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>  
>>>> 
>>>> On 24 May 2016 at 03:11, ayan guha <guha.ayan@gmail.com <mailto:guha.ayan@gmail.com>>
wrote:
>>>> Hi
>>>> 
>>>> Thanks for very useful stats. 
>>>> 
>>>> Did you have any benchmark for using Spark as backend engine for Hive vs
using Spark thrift server (and run spark code for hive queries)? We are using later but it
will be very useful to remove thriftserver, if we can. 
>>>> 
>>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke <jornfranke@gmail.com <mailto:jornfranke@gmail.com>>
wrote:
>>>> 
>>>> Hi Mich,
>>>> 
>>>> I think these comparisons are useful. One interesting aspect could be hardware
scalability in this context. Additionally different type of computations. Furthermore, one
could compare Spark and Tez+llap as execution engines. I have the gut feeling that  each one
can be justified by different use cases.
>>>> Nevertheless, there should be always a disclaimer for such comparisons, because
Spark and Hive are not good for a lot of concurrent lookups of single rows. They are not good
for frequently write small amounts of data (eg sensor data). Here hbase could be more interesting.
Other use cases can justify graph databases, such as Titan, or text analytics/ data matching
using Solr on Hadoop.
>>>> Finally, even if you have a lot of data you need to think if you always have
to process everything. For instance, I have found valid use cases in practice where we decided
to evaluate 10 machine learning models in parallel on only a sample of data and only evaluate
the "winning" model of the total of data.
>>>> 
>>>> As always it depends :) 
>>>> 
>>>> Best regards
>>>> 
>>>> P.s.: at least Hortonworks has in their distribution spark 1.5 with hive
1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described how to manage bringing
both together. You may check also Apache Bigtop (vendor neutral distribution) on how they
managed to bring both together.
>>>> 
>>>> On 23 May 2016, at 01:42, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
>>>> 
>>>>> Hi,
>>>>>  
>>>>> I have done a number of extensive tests using Spark-shell with Hive DB
and ORC tables.
>>>>>  
>>>>> Now one issue that we typically face is and I quote:
>>>>>  
>>>>> Spark is fast as it uses Memory and DAG. Great but when we save data
it is not fast enough
>>>>> 
>>>>> OK but there is a solution now. If you use Spark with Hive and you are
on a descent version of Hive >= 0.14, then you can also deploy Spark as execution engine
for Hive. That will make your application run pretty fast as you no longer rely on the old
Map-Reduce for Hive engine. In a nutshell what you are gaining speed in both querying and
storage.
>>>>>  
>>>>> I have made some comparisons on this set-up and I am sure some of you
will find it useful.
>>>>>  
>>>>> The version of Spark I use for Spark queries (Spark as query tool) is
1.6.
>>>>> The version of Hive I use in Hive 2
>>>>> The version of Spark I use as Hive execution engine is 1.3.1 It works
and frankly Spark 1.3.1 as an execution engine is adequate (until we sort out the Hadoop libraries
mismatch).
>>>>>  
>>>>> An example I am using Hive on Spark engine to find the min and max of
IDs for a table with 1 billion rows:
>>>>>  
>>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
stddev(id) from oraclehadoop.dummy;
>>>>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>>  
>>>>>  
>>>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>>>>>  
>>>>> INFO  : Completed compiling command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
Time taken: 1.911 seconds
>>>>> INFO  : Executing command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>> INFO  : Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>> INFO  : Total jobs = 1
>>>>> INFO  : Launching Job 1 out of 1
>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>>>  
>>>>> Query Hive on Spark job[0] stages:
>>>>> 0
>>>>> 1
>>>>> Status: Running (Hive on Spark job[0])
>>>>> Job Progress Format
>>>>> CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
>>>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>>> INFO  :
>>>>> Query Hive on Spark job[0] stages:
>>>>> INFO  : 0
>>>>> INFO  : 1
>>>>> INFO  :
>>>>> Status: Running (Hive on Spark job[0])
>>>>> INFO  : Job Progress Format
>>>>> CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
>>>>> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>>> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0: 0(+1)/1
>>>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 1/1
Finished
>>>>> Status: Finished successfully in 53.25 seconds
>>>>> OK
>>>>> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0:
0(+1)/1
>>>>> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0:
1/1 Finished
>>>>> INFO  : Status: Finished successfully in 53.25 seconds
>>>>> INFO  : Completed executing command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
Time taken: 56.337 seconds
>>>>> INFO  : OK
>>>>> +-----+------------+---------------+-----------------------+--+
>>>>> | c0  |     c1     |      c2       |          c3           |
>>>>> +-----+------------+---------------+-----------------------+--+
>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>>> +-----+------------+---------------+-----------------------+--+
>>>>> 1 row selected (58.529 seconds)
>>>>>  
>>>>> 58 seconds first run with cold cache is pretty good
>>>>>  
>>>>> And let us compare it with running the same query on map-reduce engine
>>>>>  
>>>>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
>>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the future
versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X
releases.
>>>>> No rows affected (0.007 seconds)
>>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
stddev(id) from oraclehadoop.dummy;
>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available
in the future versions. Consider using a different execution engine (i.e. spark, tez) or using
Hive 1.X releases.
>>>>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>> Total jobs = 1
>>>>> Launching Job 1 out of 1
>>>>> Number of reduce tasks determined at compile time: 1
>>>>> In order to change the average load for a reducer (in bytes):
>>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>> In order to limit the maximum number of reducers:
>>>>>   set hive.exec.reducers.max=<number>
>>>>> In order to set a constant number of reducers:
>>>>>   set mapreduce.job.reduces=<number>
>>>>> Starting Job = job_1463956731753_0005, Tracking URL = http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
<http://localhost.localdomain:8088/proxy/application_1463956731753_0005/>
>>>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill job_1463956731753_0005
>>>>> Hadoop job information for Stage-1: number of mappers: 22; number of
reducers: 1
>>>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>>> INFO  : Compiling command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>> INFO  : Semantic Analysis Completed
>>>>> INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0,
type:int, comment:null), FieldSchema(name:c1, type:int, comment:null), FieldSchema(name:c2,
type:double, comment:null), FieldSchema(name:c3, type:double, comment:null)], properties:null)
>>>>> INFO  : Completed compiling command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
Time taken: 0.144 seconds
>>>>> INFO  : Executing command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>> WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in
the future versions. Consider using a different execution engine (i.e. spark, tez) or using
Hive 1.X releases.
>>>>> INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available
in the future versions. Consider using a different execution engine (i.e. spark, tez) or using
Hive 1.X releases.
>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available
in the future versions. Consider using a different execution engine (i.e. spark, tez) or using
Hive 1.X releases.
>>>>> INFO  : Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>> INFO  : Total jobs = 1
>>>>> INFO  : Launching Job 1 out of 1
>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>>> INFO  : Number of reduce tasks determined at compile time: 1
>>>>> INFO  : In order to change the average load for a reducer (in bytes):
>>>>> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>> INFO  : In order to limit the maximum number of reducers:
>>>>> INFO  :   set hive.exec.reducers.max=<number>
>>>>> INFO  : In order to set a constant number of reducers:
>>>>> INFO  :   set mapreduce.job.reduces=<number>
>>>>> WARN  : Hadoop command-line option parsing not performed. Implement the
Tool interface and execute your application with ToolRunner to remedy this.
>>>>> INFO  : number of splits:22
>>>>> INFO  : Submitting tokens for job: job_1463956731753_0005
>>>>> INFO  : The url to track the job: http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
<http://localhost.localdomain:8088/proxy/application_1463956731753_0005/>
>>>>> INFO  : Starting Job = job_1463956731753_0005, Tracking URL = http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
<http://localhost.localdomain:8088/proxy/application_1463956731753_0005/>
>>>>> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
job_1463956731753_0005
>>>>> INFO  : Hadoop job information for Stage-1: number of mappers: 22; number
of reducers: 1
>>>>> INFO  : 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU
4.56 sec
>>>>> INFO  : 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative
CPU 4.56 sec
>>>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU
9.17 sec
>>>>> INFO  : 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative
CPU 9.17 sec
>>>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU
14.04 sec
>>>>> INFO  : 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative
CPU 14.04 sec
>>>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU
18.64 sec
>>>>> INFO  : 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative
CPU 18.64 sec
>>>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU
23.25 sec
>>>>> INFO  : 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative
CPU 23.25 sec
>>>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU
27.84 sec
>>>>> INFO  : 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative
CPU 27.84 sec
>>>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU
32.56 sec
>>>>> INFO  : 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative
CPU 32.56 sec
>>>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU
37.1 sec
>>>>> INFO  : 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative
CPU 37.1 sec
>>>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU
41.74 sec
>>>>> INFO  : 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative
CPU 41.74 sec
>>>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU
46.32 sec
>>>>> INFO  : 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative
CPU 46.32 sec
>>>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU
50.93 sec
>>>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU
55.55 sec
>>>>> INFO  : 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative
CPU 50.93 sec
>>>>> INFO  : 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative
CPU 55.55 sec
>>>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU
60.25 sec
>>>>> INFO  : 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative
CPU 60.25 sec
>>>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU
64.86 sec
>>>>> INFO  : 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative
CPU 64.86 sec
>>>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU
69.41 sec
>>>>> INFO  : 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative
CPU 69.41 sec
>>>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU
74.06 sec
>>>>> INFO  : 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative
CPU 74.06 sec
>>>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU
78.72 sec
>>>>> INFO  : 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative
CPU 78.72 sec
>>>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU
83.32 sec
>>>>> INFO  : 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative
CPU 83.32 sec
>>>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU
87.9 sec
>>>>> INFO  : 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative
CPU 87.9 sec
>>>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU
92.52 sec
>>>>> INFO  : 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative
CPU 92.52 sec
>>>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU
97.35 sec
>>>>> INFO  : 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative
CPU 97.35 sec
>>>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative
CPU 99.6 sec
>>>>> INFO  : 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative
CPU 99.6 sec
>>>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative
CPU 101.4 sec
>>>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>>>>> Ended Job = job_1463956731753_0005
>>>>> MapReduce Jobs Launched:
>>>>> Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS
Read: 5318569 HDFS Write: 46 SUCCESS
>>>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>>> OK
>>>>> INFO  : 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative
CPU 101.4 sec
>>>>> INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400
msec
>>>>> INFO  : Ended Job = job_1463956731753_0005
>>>>> INFO  : MapReduce Jobs Launched:
>>>>> INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec
  HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>>>>> INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>>> INFO  : Completed executing command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
Time taken: 142.525 seconds
>>>>> INFO  : OK
>>>>> +-----+------------+---------------+-----------------------+--+
>>>>> | c0  |     c1     |      c2       |          c3           |
>>>>> +-----+------------+---------------+-----------------------+--+
>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>>> +-----+------------+---------------+-----------------------+--+
>>>>> 1 row selected (142.744 seconds)
>>>>>  
>>>>> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds
with Hive on Spark. So you can obviously gain pretty well by using Hive on Spark.
>>>>>  
>>>>> Please also note that I did not use any vendor's build for this purpose.
I compiled Spark 1.3.1 myself.
>>>>>  
>>>>> HTH
>>>>>  
>>>>>  
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>  
>>>>> http://talebzadehmich.wordpress.com/ <http://talebzadehmich.wordpress.com/>
>>>>>  
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Best Regards,
>>>> Ayan Guha
>>>> 
>>>> 
>>> 
>> 
> 
> 


Mime
View raw message