hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Using Spark on Hive with Hive also using Spark as its execution engine
Date Mon, 23 May 2016 18:00:58 GMT
Have a look at this thread

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 23 May 2016 at 09:10, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:

> Hi Timur and everyone.
>
> I will answer your first question as it is very relevant
>
> 1) How to make 2 versions of Spark live together on the same cluster
> (libraries clash, paths, etc.) ?
> Most of the Spark users perform ETL, ML operations on Spark as well. So,
> we may have 3 Spark installations simultaneously
>
> There are two distinct points here.
>
> Using Spark as a  query engine. That is BAU and most forum members use it
> everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster
> managers. You start master that does the management of resources and you
> start slaves to create workers.
>
>  You deploy Spark either by Spark-shell, Spark-sql or submit jobs through
> spark-submit etc. You may or may not use Hive as your database. You may use
> Hbase via Phoenix etc
> If you choose to use Hive as your database, on every host of cluster
> including your master host, you ensure that Hive APIs are installed
> (meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to
> cd $SPARK_HOME/conf
> hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml
> lrwxrwxrwx 1 hduser hadoop 32 May  3 17:48 *hive-site.xml ->
> /usr/lib/hive/conf/hive-site.xml*
> Now in hive-site.xml you can define all the parameters needed for Spark
> connectivity. Remember we are making Hive use spark1.3.1  engine. WE ARE
> NOT RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or
> workers for Spark 1.3.1! It is just an execution engine like mr etc.
>
> Let us look at how we do that in hive-site,xml. Noting the settings for
> hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2
> below. That tells Hive to use spark 1.3.1 as the execution engine. You just
> install spark 1.3.1 on the host just the binary download it is
> /usr/lib/spark-1.3.1-bin-hadoop2.6
>
> In hive-site.xml, you set the properties.
>
>   <property>
>     <name>hive.execution.engine</name>
>     <value>spark</value>
>     <description>
>       Expects one of [mr, tez, spark].
>       Chooses execution engine. Options are: mr (Map reduce, default),
> tez, spark. While MR
>       remains the default engine for historical reasons, it is itself a
> historical engine
>       and is deprecated in Hive 2 line. It may be removed without further
> warning.
>     </description>
>   </property>
>
>   <property>
>     <name>spark.home</name>
>     <value>/usr/lib/spark-1.3.1-bin-hadoop2</value>
>     <description>something</description>
>   </property>
>
>  <property>
>     <name>hive.merge.sparkfiles</name>
>     <value>false</value>
>     <description>Merge small files at the end of a Spark DAG
> Transformation</description>
>   </property>
>
>  <property>
>     <name>hive.spark.client.future.timeout</name>
>     <value>60s</value>
>     <description>
>       Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is sec if not specified.
>       Timeout for requests from Hive client to remote Spark driver.
>     </description>
>  </property>
>  <property>
>     <name>hive.spark.job.monitor.timeout</name>
>     <value>60s</value>
>     <description>
>       Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is sec if not specified.
>       Timeout for job monitor to get Spark job state.
>     </description>
>  </property>
>
>   <property>
>     <name>hive.spark.client.connect.timeout</name>
>     <value>1000ms</value>
>     <description>
>       Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is msec if not specified.
>       Timeout for remote Spark driver in connecting back to Hive client.
>     </description>
>   </property>
>
>   <property>
>     <name>hive.spark.client.server.connect.timeout</name>
>     <value>90000ms</value>
>     <description>
>       Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is msec if not specified.
>       Timeout for handshake between Hive client and remote Spark driver.
> Checked by both processes.
>     </description>
>   </property>
>   <property>
>     <name>hive.spark.client.secret.bits</name>
>     <value>256</value>
>     <description>Number of bits of randomness in the generated secret for
> communication between Hive client and remote Spark driver. Rounded down to
> the nearest multiple of 8.</description>
>   </property>
>   <property>
>     <name>hive.spark.client.rpc.threads</name>
>     <value>8</value>
>     <description>Maximum number of threads for remote Spark driver's RPC
> event loop.</description>
>   </property>
>
> And other settings as well
>
> That was the Hive stuff for your Spark BAU. So there are two distinct
> things. Now going to Hive itself, you will need to add the correct assembly
> jar file for Hadoop. These are called
>
> spark-assembly-x.y.z-hadoop2.4.0.jar
>
> Where x.y.z in this case is 1.3.1
>
> The assembly file is
>
> spark-assembly-1.3.1-hadoop2.4.0.jar
>
> So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs
>
> ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
> /usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
>
> And you need to compile spark from source excluding Hadoop dependencies
>
>
> ./make-distribution.sh --name "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
>
>
> So Hive uses spark engine by default
>
> If you want to use mr in hive you just do
>
> 0: jdbc:hive2://rhes564:10010/default>
> *set hive.execution.engine=mr;*Hive-on-MR is deprecated in Hive 2 and may
> not be available in the future versions. Consider using a different
> execution engine (i.e. spark, tez) or using Hive 1.X releases.
> No rows affected (0.007 seconds)
>
> With regard to the second question
>
> *2) How stable such construction is on INSERT / UPDATE / CTAS operations?
> Any problems with writing into specific tables / directories, ORC / Parquet
> peculiarities, memory / timeout parameters tuning ?*
> With this set up that is Hive using spark as execution engine, my tests
> look OK. Basically I can do whatever I do with Hive using map-reduceengine.
> The caveat as usual is the amount of memory used by Spark for in-memory
> work. I am afraid that resource constraint will be there no matter how you
> want to deploy Spark
>
> *3)) How stable such construction is in multi-user / multi-tenant
> production environment when several people make different queries
> simultaneously?*
>
> This is subjective how you are going to deploy it and how scalable it is.
> Your mileage varies and you really need to test it for yourself to find
> out.
> Also worth noting that with Spark app using Hive ORC tables you may have
> issues with ORC tables defined ass transactional. You do not have that
> issue with Hive on Spark engine. There are certainly limitations with
> HiveSql construct. For example some properties are not implemented. Case in
> point with Spark-sql
>
> spark-sql> CREATE TEMPORARY TABLE tmp as select * from oraclehadoop.sales
> limit 10;
> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
> .
> You are likely trying to use an unsupported Hive feature.";
> However, no issue with hive on spark engine
>
> set hive.execution.engine=spark;
> 0: jdbc:hive2://rhes564:10010/default> CREATE TEMPORARY TABLE tmp as
> select * from oraclehadoop.sales limit 10;
> Starting Spark Job = d87e6c68-03f1-4c37-a9d4-f77e117039a4
> Query Hive on Spark job[0] stages:
> INFO  : Completed executing
> command(queryId=hduser_20160523090757_a474efb8-cea8-473e-8899-60bc7934a887);
> Time taken: 43.894 seconds
> INFO  : OK
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 23 May 2016 at 05:57, Mohanraj Ragupathiraj <mohanaugust@gmail.com>
> wrote:
>
>> Great Comparison !! thanks
>>
>> On Mon, May 23, 2016 at 7:42 AM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> I have done a number of extensive tests using Spark-shell with Hive DB
>>> and ORC tables.
>>>
>>>
>>>
>>> Now one issue that we typically face is and I quote:
>>>
>>>
>>>
>>> Spark is fast as it uses Memory and DAG. Great but when we save data it
>>> is not fast enough
>>>
>>> OK but there is a solution now. If you use Spark with Hive and you are
>>> on a descent version of Hive >= 0.14, then you can also deploy Spark as
>>> execution engine for Hive. That will make your application run pretty fast
>>> as you no longer rely on the old Map-Reduce for Hive engine. In a nutshell
>>> what you are gaining speed in both querying and storage.
>>>
>>>
>>>
>>> I have made some comparisons on this set-up and I am sure some of you
>>> will find it useful.
>>>
>>>
>>>
>>> The version of Spark I use for Spark queries (Spark as query tool) is
>>> 1.6.
>>>
>>> The version of Hive I use in Hive 2
>>>
>>> The version of Spark I use as Hive execution engine is 1.3.1 It works
>>> and frankly Spark 1.3.1 as an execution engine is adequate (until we sort
>>> out the Hadoop libraries mismatch).
>>>
>>>
>>>
>>> An example I am using Hive on Spark engine to find the min and max of
>>> IDs for a table with 1 billion rows:
>>>
>>>
>>>
>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
>>> stddev(id) from oraclehadoop.dummy;
>>>
>>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>
>>>
>>>
>>>
>>>
>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>>>
>>>
>>>
>>> INFO  : Completed compiling
>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>> Time taken: 1.911 seconds
>>>
>>> INFO  : Executing
>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>
>>> INFO  : Query ID =
>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>
>>> INFO  : Total jobs = 1
>>>
>>> INFO  : Launching Job 1 out of 1
>>>
>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>
>>>
>>>
>>> Query Hive on Spark job[0] stages:
>>>
>>> 0
>>>
>>> 1
>>>
>>> Status: Running (Hive on Spark job[0])
>>>
>>> Job Progress Format
>>>
>>> CurrentTime StageId_StageAttemptId:
>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>> [StageCost]
>>>
>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>
>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>
>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>
>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>
>>> INFO  :
>>>
>>> Query Hive on Spark job[0] stages:
>>>
>>> INFO  : 0
>>>
>>> INFO  : 1
>>>
>>> INFO  :
>>>
>>> Status: Running (Hive on Spark job[0])
>>>
>>> INFO  : Job Progress Format
>>>
>>> CurrentTime StageId_StageAttemptId:
>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>> [StageCost]
>>>
>>> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>
>>> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>
>>> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>
>>> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>
>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0:
>>> 0(+1)/1
>>>
>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 1/1
>>> Finished
>>>
>>> Status: Finished successfully in 53.25 seconds
>>>
>>> OK
>>>
>>> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished
>>> Stage-1_0: 0(+1)/1
>>>
>>> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished
>>> Stage-1_0: 1/1 Finished
>>>
>>> INFO  : Status: Finished successfully in 53.25 seconds
>>>
>>> INFO  : Completed executing
>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>> Time taken: 56.337 seconds
>>>
>>> INFO  : OK
>>>
>>> +-----+------------+---------------+-----------------------+--+
>>>
>>> | c0  |     c1     |      c2       |          c3           |
>>>
>>> +-----+------------+---------------+-----------------------+--+
>>>
>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>
>>> +-----+------------+---------------+-----------------------+--+
>>>
>>> 1 row selected (58.529 seconds)
>>>
>>>
>>>
>>> 58 seconds first run with cold cache is pretty good
>>>
>>>
>>>
>>> And let us compare it with running the same query on map-reduce engine
>>>
>>>
>>>
>>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
>>>
>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the
>>> future versions. Consider using a different execution engine (i.e. spark,
>>> tez) or using Hive 1.X releases.
>>>
>>> No rows affected (0.007 seconds)
>>>
>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
>>> stddev(id) from oraclehadoop.dummy;
>>>
>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
>>> the future versions. Consider using a different execution engine (i.e.
>>> spark, tez) or using Hive 1.X releases.
>>>
>>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>
>>> Total jobs = 1
>>>
>>> Launching Job 1 out of 1
>>>
>>> Number of reduce tasks determined at compile time: 1
>>>
>>> In order to change the average load for a reducer (in bytes):
>>>
>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>
>>> In order to limit the maximum number of reducers:
>>>
>>>   set hive.exec.reducers.max=<number>
>>>
>>> In order to set a constant number of reducers:
>>>
>>>   set mapreduce.job.reduces=<number>
>>>
>>> Starting Job = job_1463956731753_0005, Tracking URL =
>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>
>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
>>> job_1463956731753_0005
>>>
>>> Hadoop job information for Stage-1: number of mappers: 22; number of
>>> reducers: 1
>>>
>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>
>>> INFO  : Compiling
>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>
>>> INFO  : Semantic Analysis Completed
>>>
>>> INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0,
>>> type:int, comment:null), FieldSchema(name:c1, type:int, comment:null),
>>> FieldSchema(name:c2, type:double, comment:null), FieldSchema(name:c3,
>>> type:double, comment:null)], properties:null)
>>>
>>> INFO  : Completed compiling
>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>> Time taken: 0.144 seconds
>>>
>>> INFO  : Executing
>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>
>>> WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in
>>> the future versions. Consider using a different execution engine (i.e.
>>> spark, tez) or using Hive 1.X releases.
>>>
>>> INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
>>> available in the future versions. Consider using a different execution
>>> engine (i.e. spark, tez) or using Hive 1.X releases.
>>>
>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
>>> the future versions. Consider using a different execution engine (i.e.
>>> spark, tez) or using Hive 1.X releases.
>>>
>>> INFO  : Query ID =
>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>
>>> INFO  : Total jobs = 1
>>>
>>> INFO  : Launching Job 1 out of 1
>>>
>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>
>>> INFO  : Number of reduce tasks determined at compile time: 1
>>>
>>> INFO  : In order to change the average load for a reducer (in bytes):
>>>
>>> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
>>>
>>> INFO  : In order to limit the maximum number of reducers:
>>>
>>> INFO  :   set hive.exec.reducers.max=<number>
>>>
>>> INFO  : In order to set a constant number of reducers:
>>>
>>> INFO  :   set mapreduce.job.reduces=<number>
>>>
>>> WARN  : Hadoop command-line option parsing not performed. Implement the
>>> Tool interface and execute your application with ToolRunner to remedy this.
>>>
>>> INFO  : number of splits:22
>>>
>>> INFO  : Submitting tokens for job: job_1463956731753_0005
>>>
>>> INFO  : The url to track the job:
>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>
>>> INFO  : Starting Job = job_1463956731753_0005, Tracking URL =
>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>
>>> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
>>> job_1463956731753_0005
>>>
>>> INFO  : Hadoop job information for Stage-1: number of mappers: 22;
>>> number of reducers: 1
>>>
>>> INFO  : 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>
>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU
>>> 4.56 sec
>>>
>>> INFO  : 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%,
>>> Cumulative CPU 4.56 sec
>>>
>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU
>>> 9.17 sec
>>>
>>> INFO  : 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%,
>>> Cumulative CPU 9.17 sec
>>>
>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU
>>> 14.04 sec
>>>
>>> INFO  : 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%,
>>> Cumulative CPU 14.04 sec
>>>
>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU
>>> 18.64 sec
>>>
>>> INFO  : 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%,
>>> Cumulative CPU 18.64 sec
>>>
>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU
>>> 23.25 sec
>>>
>>> INFO  : 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%,
>>> Cumulative CPU 23.25 sec
>>>
>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU
>>> 27.84 sec
>>>
>>> INFO  : 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%,
>>> Cumulative CPU 27.84 sec
>>>
>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU
>>> 32.56 sec
>>>
>>> INFO  : 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%,
>>> Cumulative CPU 32.56 sec
>>>
>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU
>>> 37.1 sec
>>>
>>> INFO  : 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%,
>>> Cumulative CPU 37.1 sec
>>>
>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU
>>> 41.74 sec
>>>
>>> INFO  : 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%,
>>> Cumulative CPU 41.74 sec
>>>
>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU
>>> 46.32 sec
>>>
>>> INFO  : 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%,
>>> Cumulative CPU 46.32 sec
>>>
>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU
>>> 50.93 sec
>>>
>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU
>>> 55.55 sec
>>>
>>> INFO  : 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%,
>>> Cumulative CPU 50.93 sec
>>>
>>> INFO  : 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%,
>>> Cumulative CPU 55.55 sec
>>>
>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU
>>> 60.25 sec
>>>
>>> INFO  : 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%,
>>> Cumulative CPU 60.25 sec
>>>
>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU
>>> 64.86 sec
>>>
>>> INFO  : 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%,
>>> Cumulative CPU 64.86 sec
>>>
>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU
>>> 69.41 sec
>>>
>>> INFO  : 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%,
>>> Cumulative CPU 69.41 sec
>>>
>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU
>>> 74.06 sec
>>>
>>> INFO  : 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%,
>>> Cumulative CPU 74.06 sec
>>>
>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU
>>> 78.72 sec
>>>
>>> INFO  : 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%,
>>> Cumulative CPU 78.72 sec
>>>
>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU
>>> 83.32 sec
>>>
>>> INFO  : 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%,
>>> Cumulative CPU 83.32 sec
>>>
>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU
>>> 87.9 sec
>>>
>>> INFO  : 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%,
>>> Cumulative CPU 87.9 sec
>>>
>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU
>>> 92.52 sec
>>>
>>> INFO  : 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%,
>>> Cumulative CPU 92.52 sec
>>>
>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU
>>> 97.35 sec
>>>
>>> INFO  : 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%,
>>> Cumulative CPU 97.35 sec
>>>
>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
>>> 99.6 sec
>>>
>>> INFO  : 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%,
>>> Cumulative CPU 99.6 sec
>>>
>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative
>>> CPU 101.4 sec
>>>
>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>>>
>>> Ended Job = job_1463956731753_0005
>>>
>>> MapReduce Jobs Launched:
>>>
>>> Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS
>>> Read: 5318569 HDFS Write: 46 SUCCESS
>>>
>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>
>>> OK
>>>
>>> INFO  : 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%,
>>> Cumulative CPU 101.4 sec
>>>
>>> INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400
>>> msec
>>>
>>> INFO  : Ended Job = job_1463956731753_0005
>>>
>>> INFO  : MapReduce Jobs Launched:
>>>
>>> INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec
>>> HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>>>
>>> INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>
>>> INFO  : Completed executing
>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>> Time taken: 142.525 seconds
>>>
>>> INFO  : OK
>>>
>>> +-----+------------+---------------+-----------------------+--+
>>>
>>> | c0  |     c1     |      c2       |          c3           |
>>>
>>> +-----+------------+---------------+-----------------------+--+
>>>
>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>
>>> +-----+------------+---------------+-----------------------+--+
>>>
>>> 1 row selected (142.744 seconds)
>>>
>>>
>>>
>>> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds
>>> with Hive on Spark. So you can obviously gain pretty well by using Hive on
>>> Spark.
>>>
>>>
>>>
>>> Please also note that I did not use any vendor's build for this purpose.
>>> I compiled Spark 1.3.1 myself.
>>>
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com/
>>>
>>>
>>>
>>
>>
>>
>> --
>> Thanks and Regards
>> Mohan
>> VISA Pte Limited, Singapore.
>>
>
>

Mime
View raw message