Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
MIME-Version: 1.0
In-Reply-To: <CANXtaKChRJUxDQmcOYfHdmYNm_RW4gBTO-JNQDxh_UZzQM4hEw@mail.gmail.com>
References: <CAJ3fcbD0boCWw_4iDbhhZCMSH9Cumty-ka_1Jk0BpVdsNbnOKA@mail.gmail.com>
 <1A0C519B-0BCA-4E9C-89B0-0546F6BFA346@gmail.com> <CAA+15pe=FhaNtCUzNfMUU07zxz0bPcNuPs87ZYE4J21mDwwRwA@mail.gmail.com>
 <CAJ3fcbBmpZFsx-h+BtoLLM9mG=z=LCGUHW2ruBcjSPGQ1txiyw@mail.gmail.com>
 <CAJ3fcbCS_Z44P8H-AaV8KSYb2Ofpi-xFVU_xRShsXmgMX6B=UA@mail.gmail.com>
 <6402781A-E1F7-47F4-904F-7F6DA3A67CC5@gmail.com> <CAJ3fcbB3R5hKLott53VLmzz6Rsuh8Up5s=mJ-P7RheQs7LL4Xg@mail.gmail.com>
 <341CD464-B765-46B3-91BB-CCCD48B64769@gmail.com> <CAJ3fcbBW+1+CJDbdsue6OaxF65KQArUqt2MKAobN5_0mCd+gvg@mail.gmail.com>
 <BLU436-SMTP116F7E10A020EA58EA56603983F0@phx.gbl> <CAJ3fcbC=ZAFeTkE59LouvamfVuynTwfHWjX7m9txmuPd2FWkjw@mail.gmail.com>
 <D3A92C02.48163%gopal@hortonworks.com> <CAJ3fcbARSvtMGwrpJAdS6i4eKnh=0NKBupCsLJuYOcknA=Q_mg@mail.gmail.com>
 <CY4PR06MB2344300514ECC7EB708CC761AF300@CY4PR06MB2344.namprd06.prod.outlook.com>
 <CAJ3fcbDhAehKQvgHQCPsBTm0zZZ5CfPLdLnQrfjf=yYWcOVmeA@mail.gmail.com>
 <CY4PR06MB23440AB9D7508B1E10ED54E2AF300@CY4PR06MB2344.namprd06.prod.outlook.com>
 <CAJ3fcbCKGHCOQrkfG2fqVGJ5AWBvBniWfng0NpXoBjBSL_+wZQ@mail.gmail.com>
 <F5F1F54F-1D80-4F40-810D-B6BB0FF57C94@gmail.com> <CAJ3fcbAk3t3cp_mJD+KX_XjWsb5=xBQ-w_My8kEeZ9T7gX-Cxw@mail.gmail.com>
 <CANXtaKChRJUxDQmcOYfHdmYNm_RW4gBTO-JNQDxh_UZzQM4hEw@mail.gmail.com>
From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
Date: Tue, 12 Jul 2016 14:39:34 +0100
Message-ID: <CAJ3fcbCicFRzv+WvsRMnU0y-XgkN0VYF5T-SXJbyJyXbeqxoGQ@mail.gmail.com>
Subject: Re: Using Spark on Hive with Hive also using Spark as its execution engine
To: user <user@hive.apache.org>
Content-Type: multipart/alternative; boundary=94eb2c124a32576ecf053770662c
archived-at: Tue, 12 Jul 2016 13:39:48 -0000

--94eb2c124a32576ecf053770662c
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

thanks Marcin.

What Is your guesstimate on the order of "faster" please?

Cheers

Dr Mich Talebzadeh


LinkedIn * https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6=
zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdO=
ABUrV8Pw>*


http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


On 12 July 2016 at 14:35, Marcin Tustin <mtustin@handybook.com> wrote:

> Quick note - my experience (no benchmarks) is that Tez without LLAP (we'r=
e
> still not on hive 2) is faster than MR by some way. I haven't dug into wh=
y
> that might be.
>
> On Tue, Jul 12, 2016 at 9:19 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> sorry I completely miss your points
>>
>> I was NOT talking about Exadata. I was comparing Oracle 12c caching with
>> that of Oracle TimesTen. no one mentioned Exadata here and neither
>> storeindex etc..
>>
>>
>> so if Tez is not MR with DAG could you give me an example of how it
>> works. No opinions but relevant to this point. I do not know much about =
Tez
>> as I stated it before
>>
>> Case in point if Tez could do the job on its own why Tez is used in
>> conjunction with LLAP as Martin alluded to as well in this thread.
>>
>>
>> Having said that , I would be interested if you provide a working exampl=
e
>> of Hive on Tez, compared to Hive on MR.
>>
>> One experiment is worth hundreds of opinions
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrb=
Jd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPC=
CdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damage=
s
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 12 July 2016 at 13:31, J=C3=B6rn Franke <jornfranke@gmail.com> wrote:
>>
>>>
>>> I think the comparison with Oracle rdbms and oracle times ten is not so
>>> good. There are times when the in-memory database of Oracle is slower t=
han
>>> the rdbms (especially in case of Exadata) due to the issue that in-memo=
ry -
>>> as in Spark - means everything is in memory and everything is always
>>> processed (no storage indexes , no bloom filters etc) which explains th=
is
>>> behavior quiet well.
>>>
>>> Hence, I do not agree with the statement that tez is basically mr with
>>> dag (or that llap is basically in-memory which is also not correct). Th=
is
>>> is a wrong oversimplification and I do not think this is useful for the
>>> community, but better is to understand when something can be used and w=
hen
>>> not. In-memory is also not the solution to everything and if you look f=
or
>>> example behind SAP Hana or NoSql there is much more around this, which =
is
>>> not even on the roadmap of Spark.
>>>
>>> Anyway, discovering good use case patterns should be done on
>>> standardized benchmarks going beyond the select count etc
>>>
>>> On 12 Jul 2016, at 11:16, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>>> wrote:
>>>
>>> That is only a plan not what execution engine is doing.
>>>
>>> As I stated before Spark uses DAG + in-memory computing. MR is serial o=
n
>>> disk.
>>>
>>> The key is the execution here or rather the execution engine.
>>>
>>> In general
>>>
>>> The standard MapReduce  as I know reads the data from HDFS, apply
>>> map-reduce algorithm and writes back to HDFS. If there are many iterati=
ons
>>> of map-reduce then, there will be many intermediate writes to HDFS. Thi=
s is
>>> all serial writes to disk. Each map-reduce step is completely independe=
nt
>>> of other steps, and the executing engine does not have any global knowl=
edge
>>> of what map-reduce steps are going to come after each map-reduce step. =
For
>>> many iterative algorithms this is inefficient as the data between each
>>> map-reduce pair gets written and read from the file system.
>>>
>>> The equivalent to parallelism in Big Data is deploying what is known as
>>> Directed Acyclic Graph (DAG
>>> <https://en.wikipedia.org/wiki/Directed_acyclic_graph>) algorithm. In a
>>> nutshell deploying DAG results in a fuller picture of global optimisati=
on
>>> by deploying parallelism, pipelining consecutive map steps into one and=
 not
>>> writing intermediate data to HDFS. So in short this prevents writing da=
ta
>>> back and forth after every reduce step which for me is a significant
>>> improvement, compared to the classical MapReduce algorithm.
>>>
>>> Now Tez is basically MR with DAG. With Spark you get DAG + in-memory
>>> computing. Think of it as a comparison between a classic RDBMS like Ora=
cle
>>> and IMDB like Oracle TimesTen with in-memory processing.
>>>
>>> The outcome is that Hive using Spark as execution engine is pretty
>>> impressive. You have the advantage of Hive CBO + In-memory computing. I=
f
>>> you use Spark for all this (say Spark SQL) but no Hive, Spark uses its =
own
>>> optimizer called Catalyst that does not have CBO yet plus in memory
>>> computing.
>>>
>>> As usual your mileage varies.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianr=
bJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcP=
CCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damag=
es
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 12 July 2016 at 09:33, Markovitz, Dudu <dmarkovitz@paypal.com> wrote=
:
>>>
>>>> I don=E2=80=99t see how this explains the time differences.
>>>>
>>>>
>>>>
>>>> Dudu
>>>>
>>>>
>>>>
>>>> *From:* Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com]
>>>> *Sent:* Tuesday, July 12, 2016 10:56 AM
>>>> *To:* user <user@hive.apache.org>
>>>> *Cc:* user @spark <user@spark.apache.org>
>>>>
>>>> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its
>>>> execution engine
>>>>
>>>>
>>>>
>>>> This the whole idea. Spark uses DAG + IM, MR is classic
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> This is for Hive on Spark
>>>>
>>>>
>>>>
>>>> hive> explain select max(id) from dummy_parquet;
>>>> OK
>>>> STAGE DEPENDENCIES:
>>>>   Stage-1 is a root stage
>>>>   Stage-0 depends on stages: Stage-1
>>>>
>>>> STAGE PLANS:
>>>>   Stage: Stage-1
>>>>     Spark
>>>>       Edges:
>>>>         Reducer 2 <- Map 1 (GROUP, 1)
>>>> *      DagName:
>>>> hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1*
>>>>       Vertices:
>>>>         Map 1
>>>>             Map Operator Tree:
>>>>                 TableScan
>>>>                   alias: dummy_parquet
>>>>                   Statistics: Num rows: 100000000 Data size: 700000000
>>>> Basic stats: COMPLETE Column stats: NONE
>>>>                   Select Operator
>>>>                     expressions: id (type: int)
>>>>                     outputColumnNames: id
>>>>                     Statistics: Num rows: 100000000 Data size:
>>>> 700000000 Basic stats: COMPLETE Column stats: NONE
>>>>                     Group By Operator
>>>>                       aggregations: max(id)
>>>>                       mode: hash
>>>>                       outputColumnNames: _col0
>>>>                       Statistics: Num rows: 1 Data size: 4 Basic stats=
:
>>>> COMPLETE Column stats: NONE
>>>>                       Reduce Output Operator
>>>>                         sort order:
>>>>                         Statistics: Num rows: 1 Data size: 4 Basic
>>>> stats: COMPLETE Column stats: NONE
>>>>                         value expressions: _col0 (type: int)
>>>>         Reducer 2
>>>>             Reduce Operator Tree:
>>>>               Group By Operator
>>>>                 aggregations: max(VALUE._col0)
>>>>                 mode: mergepartial
>>>>                 outputColumnNames: _col0
>>>>                 Statistics: Num rows: 1 Data size: 4 Basic stats:
>>>> COMPLETE Column stats: NONE
>>>>                 File Output Operator
>>>>                   compressed: false
>>>>                   Statistics: Num rows: 1 Data size: 4 Basic stats:
>>>> COMPLETE Column stats: NONE
>>>>                   table:
>>>>                       input format:
>>>> org.apache.hadoop.mapred.TextInputFormat
>>>>                       output format:
>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>>                       serde:
>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>>
>>>>   Stage: Stage-0
>>>>     Fetch Operator
>>>>       limit: -1
>>>>       Processor Tree:
>>>>         ListSink
>>>>
>>>> Time taken: 2.801 seconds, Fetched: 50 row(s)
>>>>
>>>>
>>>>
>>>> And this is with setting the execution engine to MR
>>>>
>>>>
>>>>
>>>> hive> set hive.execution.engine=3Dmr;
>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the
>>>> future versions. Consider using a different execution engine (i.e. spa=
rk,
>>>> tez) or using Hive 1.X releases.
>>>>
>>>>
>>>>
>>>> hive> explain select max(id) from dummy_parquet;
>>>> OK
>>>> STAGE DEPENDENCIES:
>>>>   Stage-1 is a root stage
>>>>   Stage-0 depends on stages: Stage-1
>>>>
>>>> STAGE PLANS:
>>>>   Stage: Stage-1
>>>>     Map Reduce
>>>>       Map Operator Tree:
>>>>           TableScan
>>>>             alias: dummy_parquet
>>>>             Statistics: Num rows: 100000000 Data size: 700000000 Basic
>>>> stats: COMPLETE Column stats: NONE
>>>>             Select Operator
>>>>               expressions: id (type: int)
>>>>               outputColumnNames: id
>>>>               Statistics: Num rows: 100000000 Data size: 700000000
>>>> Basic stats: COMPLETE Column stats: NONE
>>>>               Group By Operator
>>>>                 aggregations: max(id)
>>>>                 mode: hash
>>>>                 outputColumnNames: _col0
>>>>                 Statistics: Num rows: 1 Data size: 4 Basic stats:
>>>> COMPLETE Column stats: NONE
>>>>                 Reduce Output Operator
>>>>                   sort order:
>>>>                   Statistics: Num rows: 1 Data size: 4 Basic stats:
>>>> COMPLETE Column stats: NONE
>>>>                   value expressions: _col0 (type: int)
>>>>       Reduce Operator Tree:
>>>>         Group By Operator
>>>>           aggregations: max(VALUE._col0)
>>>>           mode: mergepartial
>>>>           outputColumnNames: _col0
>>>>           Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
>>>> Column stats: NONE
>>>>           File Output Operator
>>>>             compressed: false
>>>>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
>>>> Column stats: NONE
>>>>             table:
>>>>                 input format: org.apache.hadoop.mapred.TextInputFormat
>>>>                 output format:
>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>>                 serde:
>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>>
>>>>   Stage: Stage-0
>>>>     Fetch Operator
>>>>       limit: -1
>>>>       Processor Tree:
>>>>         ListSink
>>>>
>>>> Time taken: 0.1 seconds, Fetched: 44 row(s)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn  *https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxian=
rbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6Ac=
PCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which ma=
y
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary dama=
ges
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 12 July 2016 at 08:16, Markovitz, Dudu <dmarkovitz@paypal.com>
>>>> wrote:
>>>>
>>>> This is a simple task =E2=80=93
>>>>
>>>> Read the files, find the local max value and combine the results (find
>>>> the global max value).
>>>>
>>>> How do you explain the differences in the results? Spark reads the
>>>> files and finds a local max 10X (+) faster than MR?
>>>>
>>>> Can you please attach the execution plan?
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> Dudu
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From:* Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com]
>>>> *Sent:* Monday, July 11, 2016 11:55 PM
>>>> *To:* user <user@hive.apache.org>; user @spark <user@spark.apache.org>
>>>> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its
>>>> execution engine
>>>>
>>>>
>>>>
>>>> In my test I did like for like keeping the systematic the same namely:
>>>>
>>>>
>>>>
>>>>    1. Table was a parquet table of 100 Million rows
>>>>    2. The same set up was used for both Hive on Spark and Hive on MR
>>>>    3. Spark was very impressive compared to MR on this particular test=
.
>>>>
>>>>
>>>>
>>>> Just to see any issues I created an ORC table in in the image of
>>>> Parquet (insert/select from Parquet to ORC) with stats updated for col=
umns
>>>> etc
>>>>
>>>>
>>>>
>>>> These were the results of the same run using ORC table this time:
>>>>
>>>>
>>>>
>>>> hive> select max(id) from oraclehadoop.dummy;
>>>>
>>>> Starting Spark Job =3D b886b869-5500-4ef7-aab9-ae6fb4dad22b
>>>>
>>>> Query Hive on Spark job[1] stages:
>>>> 2
>>>> 3
>>>>
>>>> Status: Running (Hive on Spark job[1])
>>>> Job Progress Format
>>>> CurrentTime StageId_StageAttemptId:
>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCou=
nt
>>>> [StageCost]
>>>> 2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23     Stage-3_0: 0/1
>>>> 2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23     Stage-3_0: 0/1
>>>> 2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23     Stage-3_0: 0/1
>>>> 2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23     Stage-3_0: 0/1
>>>> 2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23     Stage-3_0: 0/1
>>>> 2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23    Stage-3_0: 0/1
>>>> 2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23    Stage-3_0: 0/1
>>>> 2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23    Stage-3_0: 0/1
>>>> 2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23    Stage-3_0: 0/1
>>>> 2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23    Stage-3_0: 0/1
>>>> 2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished       Stage-3_0:
>>>> 0(+1)/1
>>>> 2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished       Stage-3_0: 1/1
>>>> Finished
>>>> Status: Finished successfully in 16.08 seconds
>>>> OK
>>>> 100000000
>>>> Time taken: 17.775 seconds, Fetched: 1 row(s)
>>>>
>>>>
>>>>
>>>> Repeat with MR engine
>>>>
>>>>
>>>>
>>>> hive> set hive.execution.engine=3Dmr;
>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the
>>>> future versions. Consider using a different execution engine (i.e. spa=
rk,
>>>> tez) or using Hive 1.X releases.
>>>>
>>>>
>>>>
>>>> hive> select max(id) from oraclehadoop.dummy;
>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available i=
n
>>>> the future versions. Consider using a different execution engine (i.e.
>>>> spark, tez) or using Hive 1.X releases.
>>>> Query ID =3D hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf=
8
>>>> Total jobs =3D 1
>>>> Launching Job 1 out of 1
>>>> Number of reduce tasks determined at compile time: 1
>>>> In order to change the average load for a reducer (in bytes):
>>>>   set hive.exec.reducers.bytes.per.reducer=3D<number>
>>>> In order to limit the maximum number of reducers:
>>>>   set hive.exec.reducers.max=3D<number>
>>>> In order to set a constant number of reducers:
>>>>   set mapreduce.job.reduces=3D<number>
>>>> Starting Job =3D job_1468226887011_0008, Tracking URL =3D
>>>> http://rhes564:8088/proxy/application_1468226887011_0008/
>>>> Kill Command =3D /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
>>>> job_1468226887011_0008
>>>> Hadoop job information for Stage-1: number of mappers: 23; number of
>>>> reducers: 1
>>>> 2016-07-11 21:37:00,061 Stage-1 map =3D 0%,  reduce =3D 0%
>>>> 2016-07-11 21:37:06,440 Stage-1 map =3D 4%,  reduce =3D 0%, Cumulative=
 CPU
>>>> 16.48 sec
>>>> 2016-07-11 21:37:14,751 Stage-1 map =3D 9%,  reduce =3D 0%, Cumulative=
 CPU
>>>> 40.63 sec
>>>> 2016-07-11 21:37:22,048 Stage-1 map =3D 13%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 58.88 sec
>>>> 2016-07-11 21:37:30,412 Stage-1 map =3D 17%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 80.72 sec
>>>> 2016-07-11 21:37:37,707 Stage-1 map =3D 22%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 103.43 sec
>>>> 2016-07-11 21:37:45,999 Stage-1 map =3D 26%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 125.93 sec
>>>> 2016-07-11 21:37:54,300 Stage-1 map =3D 30%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 147.17 sec
>>>> 2016-07-11 21:38:01,538 Stage-1 map =3D 35%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 166.56 sec
>>>> 2016-07-11 21:38:08,807 Stage-1 map =3D 39%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 189.29 sec
>>>> 2016-07-11 21:38:17,115 Stage-1 map =3D 43%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 211.03 sec
>>>> 2016-07-11 21:38:24,363 Stage-1 map =3D 48%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 235.68 sec
>>>> 2016-07-11 21:38:32,638 Stage-1 map =3D 52%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 258.27 sec
>>>> 2016-07-11 21:38:40,916 Stage-1 map =3D 57%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 278.44 sec
>>>> 2016-07-11 21:38:49,206 Stage-1 map =3D 61%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 300.35 sec
>>>> 2016-07-11 21:38:58,524 Stage-1 map =3D 65%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 322.89 sec
>>>> 2016-07-11 21:39:07,889 Stage-1 map =3D 70%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 344.8 sec
>>>> 2016-07-11 21:39:16,151 Stage-1 map =3D 74%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 367.77 sec
>>>> 2016-07-11 21:39:25,456 Stage-1 map =3D 78%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 391.82 sec
>>>> 2016-07-11 21:39:33,725 Stage-1 map =3D 83%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 415.48 sec
>>>> 2016-07-11 21:39:43,037 Stage-1 map =3D 87%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 436.09 sec
>>>> 2016-07-11 21:39:51,292 Stage-1 map =3D 91%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 459.4 sec
>>>> 2016-07-11 21:39:59,563 Stage-1 map =3D 96%,  reduce =3D 0%, Cumulativ=
e CPU
>>>> 477.92 sec
>>>> 2016-07-11 21:40:05,760 Stage-1 map =3D 100%,  reduce =3D 0%, Cumulati=
ve
>>>> CPU 491.72 sec
>>>> 2016-07-11 21:40:10,921 Stage-1 map =3D 100%,  reduce =3D 100%, Cumula=
tive
>>>> CPU 499.37 sec
>>>> MapReduce Total cumulative CPU time: 8 minutes 19 seconds 370 msec
>>>> Ended Job =3D job_1468226887011_0008
>>>> MapReduce Jobs Launched:
>>>> Stage-Stage-1: Map: 23  Reduce: 1   Cumulative CPU: 499.37 sec   HDFS
>>>> Read: 403754774 HDFS Write: 10 SUCCESS
>>>> Total MapReduce CPU Time Spent: 8 minutes 19 seconds 370 msec
>>>> OK
>>>> 100000000
>>>> Time taken: 202.333 seconds, Fetched: 1 row(s)
>>>>
>>>>
>>>>
>>>> So in summary
>>>>
>>>>
>>>>
>>>> Table             MR/sec                 Spark/sec
>>>>
>>>> Parquet           239.532                14.38
>>>>
>>>> ORC               202.333                17.77
>>>>
>>>>
>>>>
>>>>  Still I would use Spark if I had a choice and I agree that on VLT
>>>> (very large tables), the limitation in available memory may be the
>>>> overriding factor in using Spark.
>>>>
>>>>
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn  *https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxian=
rbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6Ac=
PCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which ma=
y
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary dama=
ges
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 11 July 2016 at 19:25, Gopal Vijayaraghavan <gopalv@apache.org>
>>>> wrote:
>>>>
>>>>
>>>> > Status: Finished successfully in 14.12 seconds
>>>> > OK
>>>> > 100000000
>>>> > Time taken: 14.38 seconds, Fetched: 1 row(s)
>>>>
>>>> That might be an improvement over MR, but that still feels far too slo=
w.
>>>>
>>>>
>>>> Parquet numbers are in general bad in Hive, but that's because the
>>>> Parquet
>>>> reader gets no actual love from the devs. The community, if it wants t=
o
>>>> keep using Parquet heavily needs a Hive dev to go over to Parquet-mr a=
nd
>>>> cut a significant number of memory copies out of the reader.
>>>>
>>>> The Spark 2.0 build for instance, has a custom Parquet reader for
>>>> SparkSQL
>>>> which does this. SPARK-12854 does for Spark+Parquet what Hive 2.0 does
>>>> for
>>>> ORC (actually, it looks more like hive's VectorizedRowBatch than
>>>> Tungsten's flat layouts).
>>>>
>>>> But that reader cannot be used in Hive-on-Spark, because it is not a
>>>> public reader impl.
>>>>
>>>>
>>>> Not to pick an arbitrary dataset, my workhorse example is a TPC-H
>>>> lineitem
>>>> at 10Gb scale with a single 16 box.
>>>>
>>>> hive(tpch_flat_orc_10)> select max(l_discount) from lineitem;
>>>> Query ID =3D gopal_20160711175917_f96371aa-2721-49c8-99a0-f7c4a1eacfda
>>>> Total jobs =3D 1
>>>> Launching Job 1 out of 1
>>>>
>>>>
>>>> Status: Running (Executing on YARN cluster with App id
>>>> application_1466700718395_0256)
>>>>
>>>>
>>>> ----------------------------------------------------------------------=
-----
>>>> -------------------
>>>>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING
>>>> PENDING  FAILED  KILLED
>>>>
>>>> ----------------------------------------------------------------------=
-----
>>>> -------------------
>>>> Map 1 ..........      llap     SUCCEEDED     13         13        0
>>>> 0       0       0
>>>> Reducer 2 ......      llap     SUCCEEDED      1          1        0
>>>> 0       0       0
>>>>
>>>> ----------------------------------------------------------------------=
-----
>>>> -------------------
>>>> VERTICES: 02/02  [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D>>] 100%  ELAPSED TIME:
>>>> 0.71 s
>>>>
>>>>
>>>> ----------------------------------------------------------------------=
-----
>>>> -------------------
>>>> Status: DAG finished successfully in 0.71 seconds
>>>>
>>>> Query Execution Summary
>>>>
>>>> ----------------------------------------------------------------------=
-----
>>>> -------------------
>>>> OPERATION                            DURATION
>>>>
>>>> ----------------------------------------------------------------------=
-----
>>>> -------------------
>>>> Compile Query                           0.21s
>>>> Prepare Plan                            0.13s
>>>> Submit Plan                             0.34s
>>>> Start DAG                               0.23s
>>>> Run DAG                                 0.71s
>>>>
>>>> ----------------------------------------------------------------------=
-----
>>>> -------------------
>>>>
>>>> Task Execution Summary
>>>>
>>>> ----------------------------------------------------------------------=
-----
>>>> -------------------
>>>>   VERTICES   DURATION(ms)  CPU_TIME(ms)  GC_TIME(ms)  INPUT_RECORDS
>>>> OUTPUT_RECORDS
>>>>
>>>> ----------------------------------------------------------------------=
-----
>>>> -------------------
>>>>      Map 1         604.00             0            0     59,957,438
>>>>       13
>>>>  Reducer 2         105.00             0            0             13
>>>>        0
>>>>
>>>> ----------------------------------------------------------------------=
-----
>>>> -------------------
>>>>
>>>> LLAP IO Summary
>>>>
>>>> ----------------------------------------------------------------------=
-----
>>>> -------------------
>>>>   VERTICES ROWGROUPS  META_HIT  META_MISS  DATA_HIT  DATA_MISS
>>>> ALLOCATION
>>>>     USED  TOTAL_IO
>>>>
>>>> ----------------------------------------------------------------------=
-----
>>>> -------------------
>>>>      Map 1      6036         0        146        0B    68.86MB
>>>> 491.00MB
>>>> 479.89MB     7.94s
>>>>
>>>> ----------------------------------------------------------------------=
-----
>>>> -------------------
>>>>
>>>> OK
>>>> 0.1
>>>> Time taken: 1.669 seconds, Fetched: 1 row(s)
>>>> hive(tpch_flat_orc_10)>
>>>>
>>>>
>>>> This is running against a single 16 core box & I would assume it would
>>>> take <1.4s to read twice as much (13 tasks is barely touching the load
>>>> factors).
>>>>
>>>> It would probably be a bit faster if the cache had hits, but in genera=
l
>>>> 14s to read a 100M rows is nearly a magnitude off where Hive 2.2.0 is.
>>>>
>>>> Cheers,
>>>> Gopal
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50=
m-in-round-led-by-fidelity/> led
> by Fidelity
>
>

--94eb2c124a32576ecf053770662c
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>thanks Marcin.</div><div><br></div><div>What Is your =
guesstimate on the order of &quot;faster&quot; please?</div><div><br></div>=
<div>Cheers</div></div><div class=3D"gmail_extra"><br clear=3D"all"><div><d=
iv class=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D=
"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><fo=
nt color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">Dr Mich Talebzadeh</font></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">=C2=A0</font></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"font-family:&quot;Ari=
al&quot;,sans-serif"><font color=3D"#000000" size=3D"3">LinkedIn </font></s=
pan><i><span style=3D"font-family:&quot;Arial&quot;,sans-serif;font-size:10=
pt"><font color=3D"#000000">=C2=A0</font><a href=3D"https://www.linkedin.co=
m/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw" target=3D"_bla=
nk"><font color=3D"#0000ff">https://www.linkedin.com/profile/view?id=3DAAEA=
AAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw</font></a></span></i></p><font color=3D=
"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">=C2=A0</font></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt;text-align:justify"><span style=3D"fo=
nt-family:&quot;Arial&quot;,sans-serif;font-size:10pt"><a href=3D"http://ta=
lebzadehmich.wordpress.com" target=3D"_blank"><font color=3D"#0000ff">http:=
//talebzadehmich.wordpress.com</font></a></span></p><p style=3D"margin:0cm =
0cm 0pt;text-align:justify"><span style=3D"font-family:&quot;Arial&quot;,sa=
ns-serif;font-size:10pt"><br></span></p><span style=3D"font-family:&quot;Ar=
ial&quot;,sans-serif;font-size:10pt"><p style=3D"margin:0cm 0cm 0pt;text-al=
ign:justify"><font color=3D"#000000"><b><span style=3D"font-family:&quot;Ti=
mes New Roman&quot;,serif;font-size:9pt">Disclaimer:</span></b><span style=
=3D"font-family:&quot;Times New Roman&quot;,serif;font-size:9pt">=C2=A0Use =
it=C2=A0at your own risk.<font size=3D"3"> </font>Any and all responsibilit=
y for any loss, damage or destruction
of data or any other property which may arise from relying on this email=
9;s=C2=A0technical=C2=A0content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from =
such
loss, damage or destruction. </span></font></p></span><p style=3D"margin:0c=
m 0cm 0pt;text-align:justify"><span style=3D"font-family:&quot;Arial&quot;,=
sans-serif;font-size:9pt"><font color=3D"#000000">=C2=A0</font></span></p><=
font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font></div></div></div></div></div></div></div></div></div>
<br><div class=3D"gmail_quote">On 12 July 2016 at 14:35, Marcin Tustin <spa=
n dir=3D"ltr">&lt;<a href=3D"mailto:mtustin@handybook.com" target=3D"_blank=
">mtustin@handybook.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex"><div dir=3D"ltr">Quick note - my experience (no benchmarks) is that Te=
z without LLAP (we&#39;re still not on hive 2) is faster than MR by some wa=
y. I haven&#39;t dug into why that might be.</div><div class=3D"HOEnZb"><di=
v class=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On=
 Tue, Jul 12, 2016 at 9:19 AM, Mich Talebzadeh <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:mich.talebzadeh@gmail.com" target=3D"_blank">mich.talebzadeh@gm=
ail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D=
"margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,20=
4);border-left-width:1px;border-left-style:solid"><div dir=3D"ltr"><div>sor=
ry I completely miss your points</div><div><br></div><div>I=C2=A0was NOT ta=
lking about Exadata. I was comparing Oracle 12c caching with that of Oracle=
 TimesTen. no one mentioned Exadata here and neither storeindex etc.. </div=
><div><br></div><div><br></div><div>so if Tez is not MR with DAG could you =
give me an example of how it works. No opinions but relevant to this point.=
 I do not know much about Tez as I stated it before</div><div><br></div><di=
v>Case in point if Tez could do the job on its own why Tez is used in conju=
nction with LLAP as Martin alluded to as well in this thread.</div><div><br=
></div><div><br></div><div>Having said that , I would be interested if you =
provide a working example of Hive on Tez, compared to Hive on MR.</div><div=
><br></div><div>One experiment is worth hundreds of opinions</div><div><br>=
</div><div><br></div><div><br></div><div><br></div></div><div class=3D"gmai=
l_extra"><span><br clear=3D"all"><div><div data-smartmail=3D"gmail_signatur=
e"><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div d=
ir=3D"ltr"><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">Dr Mich Talebzadeh</font></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">=C2=A0</font></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"font-family:&quot;Ari=
al&quot;,sans-serif"><font color=3D"#000000" size=3D"3">LinkedIn </font></s=
pan><i><span style=3D"font-family:&quot;Arial&quot;,sans-serif;font-size:10=
pt"><font color=3D"#000000">=C2=A0</font><a href=3D"https://www.linkedin.co=
m/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw" target=3D"_bla=
nk"><font color=3D"#0000ff">https://www.linkedin.com/profile/view?id=3DAAEA=
AAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw</font></a></span></i></p><font color=3D=
"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">=C2=A0</font></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt;text-align:justify"><span style=3D"fo=
nt-family:&quot;Arial&quot;,sans-serif;font-size:10pt"><a href=3D"http://ta=
lebzadehmich.wordpress.com" target=3D"_blank"><font color=3D"#0000ff">http:=
//talebzadehmich.wordpress.com</font></a></span></p><p style=3D"margin:0cm =
0cm 0pt;text-align:justify"><span style=3D"font-family:&quot;Arial&quot;,sa=
ns-serif;font-size:10pt"><br></span></p><span style=3D"font-family:&quot;Ar=
ial&quot;,sans-serif;font-size:10pt"><p style=3D"margin:0cm 0cm 0pt;text-al=
ign:justify"><font color=3D"#000000"><b><span style=3D"font-family:&quot;Ti=
mes New Roman&quot;,serif;font-size:9pt">Disclaimer:</span></b><span style=
=3D"font-family:&quot;Times New Roman&quot;,serif;font-size:9pt">=C2=A0Use =
it=C2=A0at your own risk.<font size=3D"3"> </font>Any and all responsibilit=
y for any loss, damage or destruction
of data or any other property which may arise from relying on this email=
9;s=C2=A0technical=C2=A0content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from =
such
loss, damage or destruction. </span></font></p></span><p style=3D"margin:0c=
m 0cm 0pt;text-align:justify"><span style=3D"font-family:&quot;Arial&quot;,=
sans-serif;font-size:9pt"><font color=3D"#000000">=C2=A0</font></span></p><=
font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font></div></div></div></div></div></div></div></div></div>
<br></span><div><div><div class=3D"gmail_quote">On 12 July 2016 at 13:31, J=
=C3=B6rn Franke <span dir=3D"ltr">&lt;<a href=3D"mailto:jornfranke@gmail.co=
m" target=3D"_blank">jornfranke@gmail.com</a>&gt;</span> wrote:<br><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;padding-left:1e=
x;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-styl=
e:solid"><div dir=3D"auto"><div><br></div><div>I think the comparison with =
Oracle rdbms and oracle times ten is not so good. There are times when the =
in-memory database of Oracle is slower than the rdbms (especially in case o=
f Exadata) due to the issue that in-memory - as in Spark - means everything=
 is in memory and everything is always processed (no storage indexes , no b=
loom filters etc) which explains this behavior quiet well.</div><div><br></=
div><div>Hence, I do not agree with the statement that tez is basically mr =
with dag (or that llap is basically in-memory which is also not correct). T=
his is a wrong oversimplification and I do not think this is useful for the=
 community, but better is to understand when something can be used and when=
 not. In-memory is also not the solution to everything and if you look for =
example behind SAP Hana or NoSql there is much more around this, which is n=
ot even on the roadmap of Spark.</div><div><br></div><div>Anyway, discoveri=
ng good use case patterns should be done on standardized benchmarks going b=
eyond the select count etc=C2=A0</div><div><div><div><br>On 12 Jul 2016, at=
 11:16, Mich Talebzadeh &lt;<a href=3D"mailto:mich.talebzadeh@gmail.com" ta=
rget=3D"_blank">mich.talebzadeh@gmail.com</a>&gt; wrote:<br><br></div><bloc=
kquote type=3D"cite"><div><div dir=3D"ltr"><div>That is only a plan not wha=
t execution engine is doing.</div><div><br></div><div>As I stated before Sp=
ark uses DAG + in-memory computing. MR is serial on disk.=C2=A0</div><div><=
br></div><div>The key is the execution here or rather the execution engine.=
</div><div><br></div><div>In general</div><div><br></div><div>The standard =
MapReduce=C2=A0=C2=A0as I know reads the data from HDFS, apply map-reduce a=
lgorithm and writes back to HDFS. If there are many iterations of map-reduc=
e then, there will be many intermediate writes to HDFS. This is all serial =
writes to disk.=C2=A0Each map-reduce step is completely independent of othe=
r steps, and the executing engine does not have any global knowledge of wha=
t=C2=A0map-reduce steps are going to come after each=C2=A0map-reduce step. =
For many iterative algorithms this is inefficient as the data between each =
map-reduce pair gets written and read from the file system.</div><div><br><=
/div><div>The equivalent to parallelism in Big Data is deploying what is kn=
own as Directed Acyclic Graph (<a href=3D"https://en.wikipedia.org/wiki/Dir=
ected_acyclic_graph" target=3D"_blank" rel=3D"nofollow"><font color=3D"#006=
6cc">DAG</font></a>) algorithm. In a nutshell deploying DAG=C2=A0results in=
=C2=A0a fuller picture of global optimisation by=C2=A0deploying parallelism=
, pipelining consecutive map steps into one and not writing intermediate da=
ta to HDFS. So in short this prevents writing data back and forth after eve=
ry reduce step which for me is a significant improvement, compared to the c=
lassical MapReduce algorithm.</div><div><br></div><div>Now Tez is basically=
 MR with DAG. With Spark you get DAG + in-memory computing. Think of it as =
a comparison between a classic RDBMS like Oracle and IMDB like Oracle Times=
Ten with in-memory processing.</div><div><br></div><div>The outcome is that=
 Hive using Spark as execution engine is pretty impressive. You have the ad=
vantage of Hive CBO + In-memory computing. If you use Spark for all this (s=
ay Spark SQL) but no Hive, Spark uses its own optimizer called Catalyst=C2=
=A0that does not have CBO yet plus in memory computing.</div><div><br></div=
><div>As usual your mileage varies.</div><div><br></div><div>HTH</div><div>=
<br></div></div><div class=3D"gmail_extra"><br clear=3D"all"><div><div data=
-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><div>=
<div dir=3D"ltr"><div><div dir=3D"ltr"><font color=3D"#000000" face=3D"Time=
s New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">Dr Mich Talebzadeh</font></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">=C2=A0</font></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"font-family:&quot;Ari=
al&quot;,sans-serif"><font color=3D"#000000" size=3D"3">LinkedIn </font></s=
pan><i><span style=3D"font-family:&quot;Arial&quot;,sans-serif;font-size:10=
pt"><font color=3D"#000000">=C2=A0</font><a href=3D"https://www.linkedin.co=
m/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw" target=3D"_bla=
nk"><font color=3D"#0000ff">https://www.linkedin.com/profile/view?id=3DAAEA=
AAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw</font></a></span></i></p><font color=3D=
"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">=C2=A0</font></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt;text-align:justify"><span style=3D"fo=
nt-family:&quot;Arial&quot;,sans-serif;font-size:10pt"><a href=3D"http://ta=
lebzadehmich.wordpress.com" target=3D"_blank"><font color=3D"#0000ff">http:=
//talebzadehmich.wordpress.com</font></a></span></p><p style=3D"margin:0cm =
0cm 0pt;text-align:justify"><span style=3D"font-family:&quot;Arial&quot;,sa=
ns-serif;font-size:10pt"><br></span></p><span style=3D"font-family:&quot;Ar=
ial&quot;,sans-serif;font-size:10pt"><p style=3D"margin:0cm 0cm 0pt;text-al=
ign:justify"><font color=3D"#000000"><b><span style=3D"font-family:&quot;Ti=
mes New Roman&quot;,serif;font-size:9pt">Disclaimer:</span></b><span style=
=3D"font-family:&quot;Times New Roman&quot;,serif;font-size:9pt">=C2=A0Use =
it=C2=A0at your own risk.<font size=3D"3"> </font>Any and all responsibilit=
y for any loss, damage or destruction
of data or any other property which may arise from relying on this email=
9;s=C2=A0technical=C2=A0content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from =
such
loss, damage or destruction. </span></font></p></span><p style=3D"margin:0c=
m 0cm 0pt;text-align:justify"><span style=3D"font-family:&quot;Arial&quot;,=
sans-serif;font-size:9pt"><font color=3D"#000000">=C2=A0</font></span></p><=
font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font></div></div></div></div></div></div></div></div></div>
<br><div class=3D"gmail_quote">On 12 July 2016 at 09:33, Markovitz, Dudu <s=
pan dir=3D"ltr">&lt;<a href=3D"mailto:dmarkovitz@paypal.com" target=3D"_bla=
nk">dmarkovitz@paypal.com</a>&gt;</span> wrote:<br><blockquote class=3D"gma=
il_quote" style=3D"margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-co=
lor:rgb(204,204,204);border-left-width:1px;border-left-style:solid">


<div lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
<div>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">I don=E2=80=99t see how this exp=
lains the time differences.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt"><u></u>=C2=A0<u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">Dudu<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt"><u></u>=C2=A0<u></u></span></p>
<p class=3D"MsoNormal"><b><span style=3D"font-family:&quot;Calibri&quot;,sa=
ns-serif;font-size:11pt">From:</span></b><span style=3D"font-family:&quot;C=
alibri&quot;,sans-serif;font-size:11pt"> Mich Talebzadeh [mailto:<a href=3D=
"mailto:mich.talebzadeh@gmail.com" target=3D"_blank">mich.talebzadeh@gmail.=
com</a>]
<br>
<b>Sent:</b> Tuesday, July 12, 2016 10:56 AM<br>
<b>To:</b> user &lt;<a href=3D"mailto:user@hive.apache.org" target=3D"_blan=
k">user@hive.apache.org</a>&gt;<br>
<b>Cc:</b> user @spark &lt;<a href=3D"mailto:user@spark.apache.org" target=
=3D"_blank">user@spark.apache.org</a>&gt;</span></p><div><div><br>
<b>Subject:</b> Re: Using Spark on Hive with Hive also using Spark as its e=
xecution engine<u></u><u></u></div></div><p></p><div><div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<div>
<p class=3D"MsoNormal">This the whole idea.=C2=A0Spark uses DAG + IM, MR is=
 classic<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">This is for Hive on Spark<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:blue;font-family:&quot;Courier =
New&quot;">hive&gt; explain select max(id) from dummy_parquet;<br>
OK<br>
STAGE DEPENDENCIES:<br>
=C2=A0 Stage-1 is a root stage<br>
=C2=A0 Stage-0 depends on stages: Stage-1</span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:blue;font-family:&quot;Courier =
New&quot;">STAGE PLANS:<br>
=C2=A0 Stage: Stage-1<br>
=C2=A0=C2=A0=C2=A0 Spark<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Edges:<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Reducer 2 &lt;- Map 1 (GROUP, 1)=
<br>
<strong><span style=3D"font-family:&quot;Courier New&quot;">=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 DagName: hduser_20160712083219_632c2749-7387-478f-972d-9eaa=
dd9932c6:1</span></strong><b><br>
</b>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Vertices:<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Map 1<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Map Oper=
ator Tree:<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 TableScan<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 alias: dummy_parquet<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 Statistics: Num rows: 100000000 Data size: 7000=
00000 Basic stats: COMPLETE Column stats: NONE<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 Select Operator<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 expressions: id (type: int)<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 outputColumnNames: id<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Statistics: Num rows: 100000000 Dat=
a size: 700000000 Basic stats: COMPLETE Column stats: NONE<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Group By Operator<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 aggregations: max(id)<b=
r>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mode: hash<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 outputColumnNames: _col=
0<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Statistics: Num rows: 1=
 Data size: 4 Basic stats: COMPLETE Column stats: NONE<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Reduce Output Operator<=
br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sort order:=
<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Statistics:=
 Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 value expre=
ssions: _col0 (type: int)<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Reducer 2<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Reduce O=
perator Tree:<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 Group By Operator<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 aggregations: max(VALUE._col0)<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 mode: mergepartial<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 outputColumnNames: _col0<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE =
Column stats: NONE<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 File Output Operator<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 compressed: false<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 Statistics: Num rows: 1 Data size: 4 Basic stat=
s: COMPLETE Column stats: NONE</span><br>
<span style=3D"color:blue;font-family:&quot;Courier New&quot;">=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 table:<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 input format: org.apach=
e.hadoop.mapred.TextInputFormat<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 output format: org.apac=
he.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 serde: org.apache.hadoo=
p.hive.serde2.lazy.LazySimpleSerDe</span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:blue;font-family:&quot;Courier =
New&quot;">=C2=A0 Stage: Stage-0<br>
=C2=A0=C2=A0=C2=A0 Fetch Operator<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 limit: -1<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Processor Tree:<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ListSink</span><u></u><u></u></p=
>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:blue;font-family:&quot;Courier =
New&quot;">Time taken: 2.801 seconds, Fetched: 50 row(s)</span><u></u><u></=
u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">And this is with setting the execution engine to MR<=
u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:13.5pt">hive&gt; set hive.e=
xecution.engine=3Dmr;<br>
Hive-on-MR is deprecated in Hive 2 and may not be available in the future v=
ersions. Consider using a different execution engine (i.e. spark, tez) or u=
sing Hive 1.X releases.</span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:rgb(204,0,0);font-family:&quot;=
Courier New&quot;">hive&gt; explain select max(id) from dummy_parquet;<br>
OK<br>
STAGE DEPENDENCIES:<br>
=C2=A0 Stage-1 is a root stage<br>
=C2=A0 Stage-0 depends on stages: Stage-1</span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:rgb(204,0,0);font-family:&quot;=
Courier New&quot;">STAGE PLANS:<br>
=C2=A0 Stage: Stage-1<br>
=C2=A0=C2=A0=C2=A0 Map Reduce<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Map Operator Tree:<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 TableScan<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 alias: d=
ummy_parquet<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Statisti=
cs: Num rows: 100000000 Data size: 700000000 Basic stats: COMPLETE Column s=
tats: NONE<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Select O=
perator<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 expressions: id (type: int)<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 outputColumnNames: id<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 Statistics: Num rows: 100000000 Data size: 700000000 Basic stats: COMPL=
ETE Column stats: NONE<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 Group By Operator<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 aggregations: max(id)<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 mode: hash<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 outputColumnNames: _col0<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE =
Column stats: NONE<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 Reduce Output Operator<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 sort order:<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 Statistics: Num rows: 1 Data size: 4 Basic stat=
s: COMPLETE Column stats: NONE<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 value expressions: _col0 (type: int)<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Reduce Operator Tree:<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Group By Operator<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 aggregations: max(VA=
LUE._col0)<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mode: mergepartial<b=
r>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 outputColumnNames: _=
col0<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Statistics: Num rows=
: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 File Output Operator=
<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 compress=
ed: false<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Statisti=
cs: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 table:<b=
r>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 input format: org.apache.hadoop.mapred.TextInputFormat<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTe=
xtOutputFormat<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe</=
span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:rgb(204,0,0);font-family:&quot;=
Courier New&quot;">=C2=A0 Stage: Stage-0<br>
=C2=A0=C2=A0=C2=A0 Fetch Operator<br>
</span><span style=3D"color:rgb(204,0,0)">=C2=A0=C2=A0</span><span style=3D=
"color:rgb(204,0,0);font-family:&quot;Courier New&quot;">=C2=A0=C2=A0=C2=A0=
 limit: -1<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Processor Tree:<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ListSink</span><u></u><u></u></p=
>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:rgb(204,0,0);font-family:&quot;=
Courier New&quot;">Time taken: 0.1 seconds, Fetched: 44 row(s)</span><u></u=
><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">HTH<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div>
<div>
<p class=3D"MsoNormal"><br clear=3D"all">
<u></u><u></u></p>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<p style=3D"margin:0in 0in 0pt"><span style=3D"color:black;font-family:&quo=
t;Calibri&quot;,sans-serif">Dr Mich Talebzadeh</span><u></u><u></u></p>
<p style=3D"margin:0in 0in 0pt"><span style=3D"color:black;font-family:&quo=
t;Calibri&quot;,sans-serif">=C2=A0</span><u></u><u></u></p>
<p style=3D"margin:0in 0in 0pt"><span style=3D"color:black;font-family:&quo=
t;Arial&quot;,sans-serif">LinkedIn
</span><i><span style=3D"color:black;font-family:&quot;Arial&quot;,sans-ser=
if;font-size:10pt">=C2=A0</span></i><i><span style=3D"font-family:&quot;Ari=
al&quot;,sans-serif;font-size:10pt"><a href=3D"https://www.linkedin.com/pro=
file/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw" target=3D"_blank">h=
ttps://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdOAB=
UrV8Pw</a></span></i><u></u><u></u></p>
<p style=3D"margin:0in 0in 0pt"><span style=3D"color:black;font-family:&quo=
t;Calibri&quot;,sans-serif">=C2=A0</span><u></u><u></u></p>
<p style=3D"margin:0in 0in 0pt;text-align:justify"><span style=3D"font-fami=
ly:&quot;Arial&quot;,sans-serif;font-size:10pt"><a href=3D"http://talebzade=
hmich.wordpress.com" target=3D"_blank">http://talebzadehmich.wordpress.com<=
/a></span><u></u><u></u></p>
<p style=3D"margin:0in 0in 0pt;text-align:justify"><u></u>=C2=A0<u></u></p>
<p style=3D"margin:0in 0in 0pt;text-align:justify"><b><span style=3D"color:=
black;font-size:9pt">Disclaimer:</span></b><span style=3D"color:black;font-=
size:9pt">=C2=A0Use it=C2=A0at your own risk.</span><span style=3D"color:bl=
ack">
</span><span style=3D"color:black;font-size:9pt">Any and all responsibility=
 for any loss, damage or destruction of data or any other property which ma=
y arise from relying on this email&#39;s=C2=A0technical=C2=A0content is exp=
licitly disclaimed. The author will in no case
 be liable for any monetary damages arising from such loss, damage or destr=
uction.
</span><span style=3D"font-family:&quot;Arial&quot;,sans-serif;font-size:10=
pt"><u></u><u></u></span></p>
<p style=3D"margin:0in 0in 0pt;text-align:justify"><span style=3D"color:bla=
ck;font-family:&quot;Arial&quot;,sans-serif;font-size:9pt">=C2=A0</span><u>=
</u><u></u></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<p class=3D"MsoNormal">On 12 July 2016 at 08:16, Markovitz, Dudu &lt;<a hre=
f=3D"mailto:dmarkovitz@paypal.com" target=3D"_blank">dmarkovitz@paypal.com<=
/a>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border-width:medium medium medium 1pt;border-style:non=
e none none solid;border-color:currentColor currentColor currentColor rgb(2=
04,204,204);padding:0in 0in 0in 6pt;margin-right:0in;margin-left:4.8pt">
<div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">This is a simple task =E2=80=93<=
/span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">Read the files, find the local m=
ax value and combine the results (find the global max value).</span><u></u>=
<u></u></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">How do you explain the differenc=
es in the results? Spark reads the files and finds a local max 10X
 (+) faster than MR?</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">Can you please attach the execut=
ion plan?</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">=C2=A0</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">Thanks</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">=C2=A0</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">Dudu</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">=C2=A0</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">=C2=A0</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"color:rgb(31,73,125);font-family:&quo=
t;Calibri&quot;,sans-serif;font-size:11pt">=C2=A0</span><u></u><u></u></p>
<p class=3D"MsoNormal"><b><span style=3D"font-family:&quot;Calibri&quot;,sa=
ns-serif;font-size:11pt">From:</span></b><span style=3D"font-family:&quot;C=
alibri&quot;,sans-serif;font-size:11pt"> Mich Talebzadeh [mailto:<a href=3D=
"mailto:mich.talebzadeh@gmail.com" target=3D"_blank">mich.talebzadeh@gmail.=
com</a>]
<br>
<b>Sent:</b> Monday, July 11, 2016 11:55 PM<br>
<b>To:</b> user &lt;<a href=3D"mailto:user@hive.apache.org" target=3D"_blan=
k">user@hive.apache.org</a>&gt;; user @spark &lt;<a href=3D"mailto:user@spa=
rk.apache.org" target=3D"_blank">user@spark.apache.org</a>&gt;<br>
<b>Subject:</b> Re: Using Spark on Hive with Hive also using Spark as its e=
xecution engine</span><u></u><u></u></p>
<div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<div>
<div>
<p class=3D"MsoNormal">In my test I did like for like keeping the systemati=
c the same namely:<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<ol type=3D"1" start=3D"1">
<li class=3D"MsoNormal">
Table was a parquet table of 100 Million rows<u></u><u></u></li><li class=
=3D"MsoNormal">
The same set up was used for both Hive on Spark and Hive on MR<u></u><u></u=
></li><li class=3D"MsoNormal">
Spark was very impressive compared to MR on this particular test.<u></u><u>=
</u></li></ol>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Just to see any issues I created an ORC table in in =
the image of Parquet (insert/select from Parquet to ORC) with stats updated=
 for columns etc<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">These were the results of the same run using ORC tab=
le this time:<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">hive&gt; select max(id) from oraclehadoop.dummy;<br>
<br>
<span style=3D"color:blue;font-family:&quot;Courier New&quot;">Starting Spa=
rk Job =3D b886b869-5500-4ef7-aab9-ae6fb4dad22b</span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:blue;font-family:&quot;Courier =
New&quot;">Query Hive on Spark job[1] stages:<br>
2<br>
3</span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:blue;font-family:&quot;Courier =
New&quot;">Status: Running (Hive on Spark job[1])<br>
Job Progress Format<br>
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-=
FailedTasksCount)/TotalTasksCount [StageCost]<br>
2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23=C2=A0=C2=A0=C2=A0=C2=A0 Stage-3=
_0: 0/1<br>
2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23=C2=A0=C2=A0=C2=A0=C2=A0 Stage-3=
_0: 0/1<br>
2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23=C2=A0=C2=A0=C2=A0=C2=A0 Stage-3=
_0: 0/1<br>
2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23=C2=A0=C2=A0=C2=A0=C2=A0 Stage-3=
_0: 0/1<br>
2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23=C2=A0=C2=A0=C2=A0=C2=A0 Stage-3=
_0: 0/1<br>
2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23=C2=A0=C2=A0=C2=A0 Stage-3_0: 0=
/1<br>
2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23=C2=A0=C2=A0=C2=A0 Stage-3_0: 0=
/1<br>
2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23=C2=A0=C2=A0=C2=A0 Stage-3_0: 0=
/1<br>
2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23=C2=A0=C2=A0=C2=A0 Stage-3_0: 0=
/1<br>
2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23=C2=A0=C2=A0=C2=A0 Stage-3_0: 0=
/1<br>
2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0 Stage-3_0: 0(+1)/1<br>
2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0 Stage-3_0: 1/1 Finished<br>
Status: Finished successfully in 16.08 seconds<br>
OK<br>
100000000<br>
Time taken: </span><span style=3D"color:red;font-family:&quot;Courier New&q=
uot;">17.775 seconds</span><span style=3D"color:blue;font-family:&quot;Cour=
ier New&quot;">, Fetched: 1 row(s)</span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Repeat with MR engine<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:blue;font-family:&quot;Courier =
New&quot;">hive&gt; set hive.execution.engine=3Dmr;<br>
Hive-on-MR is deprecated in Hive 2 and may not be available in the future v=
ersions. Consider using a different execution engine (i.e. spark, tez) or u=
sing Hive 1.X releases.</span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:blue;font-family:&quot;Courier =
New&quot;">hive&gt; select max(id) from oraclehadoop.dummy;<br>
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the=
 future versions. Consider using a different execution engine (i.e. spark, =
tez) or using Hive 1.X releases.<br>
Query ID =3D hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8<br>
Total jobs =3D 1<br>
Launching Job 1 out of 1<br>
Number of reduce tasks determined at compile time: 1<br>
In order to change the average load for a reducer (in bytes):<br>
=C2=A0 set hive.exec.reducers.bytes.per.reducer=3D&lt;number&gt;<br>
In order to limit the maximum number of reducers:<br>
=C2=A0 set hive.exec.reducers.max=3D&lt;number&gt;<br>
In order to set a constant number of reducers:<br>
=C2=A0 set mapreduce.job.reduces=3D&lt;number&gt;<br>
Starting Job =3D job_1468226887011_0008, Tracking URL =3D <a href=3D"http:/=
/rhes564:8088/proxy/application_1468226887011_0008/" target=3D"_blank">
http://rhes564:8088/proxy/application_1468226887011_0008/</a><br>
Kill Command =3D /home/hduser/hadoop-2.6.0/bin/hadoop job=C2=A0 -kill job_1=
468226887011_0008<br>
Hadoop job information for Stage-1: number of mappers: 23; number of reduce=
rs: 1<br>
2016-07-11 21:37:00,061 Stage-1 map =3D 0%,=C2=A0 reduce =3D 0%<br>
2016-07-11 21:37:06,440 Stage-1 map =3D 4%,=C2=A0 reduce =3D 0%, Cumulative=
 CPU 16.48 sec<br>
2016-07-11 21:37:14,751 Stage-1 map =3D 9%,=C2=A0 reduce =3D 0%, Cumulative=
 CPU 40.63 sec<br>
2016-07-11 21:37:22,048 Stage-1 map =3D 13%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 58.88 sec<br>
2016-07-11 21:37:30,412 Stage-1 map =3D 17%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 80.72 sec<br>
2016-07-11 21:37:37,707 Stage-1 map =3D 22%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 103.43 sec<br>
2016-07-11 21:37:45,999 Stage-1 map =3D 26%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 125.93 sec<br>
2016-07-11 21:37:54,300 Stage-1 map =3D 30%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 147.17 sec<br>
2016-07-11 21:38:01,538 Stage-1 map =3D 35%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 166.56 sec<br>
2016-07-11 21:38:08,807 Stage-1 map =3D 39%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 189.29 sec<br>
2016-07-11 21:38:17,115 Stage-1 map =3D 43%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 211.03 sec<br>
2016-07-11 21:38:24,363 Stage-1 map =3D 48%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 235.68 sec<br>
2016-07-11 21:38:32,638 Stage-1 map =3D 52%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 258.27 sec<br>
2016-07-11 21:38:40,916 Stage-1 map =3D 57%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 278.44 sec<br>
2016-07-11 21:38:49,206 Stage-1 map =3D 61%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 300.35 sec<br>
2016-07-11 21:38:58,524 Stage-1 map =3D 65%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 322.89 sec<br>
2016-07-11 21:39:07,889 Stage-1 map =3D 70%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 344.8 sec<br>
2016-07-11 21:39:16,151 Stage-1 map =3D 74%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 367.77 sec<br>
2016-07-11 21:39:25,456 Stage-1 map =3D 78%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 391.82 sec<br>
2016-07-11 21:39:33,725 Stage-1 map =3D 83%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 415.48 sec<br>
2016-07-11 21:39:43,037 Stage-1 map =3D 87%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 436.09 sec<br>
2016-07-11 21:39:51,292 Stage-1 map =3D 91%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 459.4 sec<br>
2016-07-11 21:39:59,563 Stage-1 map =3D 96%,=C2=A0 reduce =3D 0%, Cumulativ=
e CPU 477.92 sec<br>
2016-07-11 21:40:05,760 Stage-1 map =3D 100%,=C2=A0 reduce =3D 0%, Cumulati=
ve CPU 491.72 sec<br>
2016-07-11 21:40:10,921 Stage-1 map =3D 100%,=C2=A0 reduce =3D 100%, Cumula=
tive CPU 499.37 sec<br>
MapReduce Total cumulative CPU time: 8 minutes 19 seconds 370 msec<br>
Ended Job =3D job_1468226887011_0008<br>
MapReduce Jobs Launched:<br>
Stage-Stage-1: Map: 23=C2=A0 Reduce: 1=C2=A0=C2=A0 Cumulative CPU: 499.37 s=
ec=C2=A0=C2=A0 HDFS Read: 403754774 HDFS Write: 10 SUCCESS<br>
Total MapReduce CPU Time Spent: 8 minutes 19 seconds 370 msec<br>
OK<br>
100000000<br>
Time taken: </span><span style=3D"color:red;font-family:&quot;Courier New&q=
uot;">202.333 seconds</span><span style=3D"color:blue;font-family:&quot;Cou=
rier New&quot;">, Fetched: 1 row(s)</span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">So in summary<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:black;font-family:&quot;Courier=
 New&quot;">Table=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 MR/sec=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Spark/sec</span><u></u><u></u=
></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:black;font-family:&quot;Courier=
 New&quot;">Parquet=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 239.532=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14.38
</span><u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"color:black;font-family:&quot;Courier=
 New&quot;">ORC=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0202.333=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A017.77</span><u>=
</u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0Still I would use Spark if I had a choice and =
I agree that on VLT (very large tables), the limitation in available memory=
 may be=C2=A0the overriding factor in using Spark.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">HTH<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
</div>
<div>
<p class=3D"MsoNormal"><br clear=3D"all">
<u></u><u></u></p>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<p style=3D"margin:0in 0in 0pt"><span style=3D"color:black;font-family:&quo=
t;Calibri&quot;,sans-serif">Dr Mich Talebzadeh</span><u></u><u></u></p>
<p style=3D"margin:0in 0in 0pt"><span style=3D"color:black;font-family:&quo=
t;Calibri&quot;,sans-serif">=C2=A0</span><u></u><u></u></p>
<p style=3D"margin:0in 0in 0pt"><span style=3D"color:black;font-family:&quo=
t;Arial&quot;,sans-serif">LinkedIn
</span><i><span style=3D"color:black;font-family:&quot;Arial&quot;,sans-ser=
if;font-size:10pt">=C2=A0</span></i><i><span style=3D"font-family:&quot;Ari=
al&quot;,sans-serif;font-size:10pt"><a href=3D"https://www.linkedin.com/pro=
file/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw" target=3D"_blank">h=
ttps://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdOAB=
UrV8Pw</a></span></i><u></u><u></u></p>
<p style=3D"margin:0in 0in 0pt"><span style=3D"color:black;font-family:&quo=
t;Calibri&quot;,sans-serif">=C2=A0</span><u></u><u></u></p>
<p style=3D"margin:0in 0in 0pt;text-align:justify"><span style=3D"font-fami=
ly:&quot;Arial&quot;,sans-serif;font-size:10pt"><a href=3D"http://talebzade=
hmich.wordpress.com" target=3D"_blank">http://talebzadehmich.wordpress.com<=
/a></span><u></u><u></u></p>
<p style=3D"margin:0in 0in 0pt;text-align:justify">=C2=A0<u></u><u></u></p>
<p style=3D"margin:0in 0in 0pt;text-align:justify"><b><span style=3D"color:=
black;font-size:9pt">Disclaimer:</span></b><span style=3D"color:black;font-=
size:9pt">=C2=A0Use it=C2=A0at your own risk.</span><span style=3D"color:bl=
ack">
</span><span style=3D"color:black;font-size:9pt">Any and all responsibility=
 for any loss, damage or destruction of data or any other property which ma=
y arise from relying on this email&#39;s=C2=A0technical=C2=A0content is exp=
licitly disclaimed. The author will in no case
 be liable for any monetary damages arising from such loss, damage or destr=
uction.
</span><u></u><u></u></p>
<p style=3D"margin:0in 0in 0pt;text-align:justify"><span style=3D"color:bla=
ck;font-family:&quot;Arial&quot;,sans-serif;font-size:9pt">=C2=A0</span><u>=
</u><u></u></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">On 11 July 2016 at 19:25, Gopal Vijayaraghavan &lt;<=
a href=3D"mailto:gopalv@apache.org" target=3D"_blank">gopalv@apache.org</a>=
&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border-width:medium medium medium 1pt;border-style:non=
e none none solid;border-color:currentColor currentColor currentColor rgb(2=
04,204,204);margin:5pt 0in 5pt 4.8pt;padding:0in 0in 0in 6pt">
<p class=3D"MsoNormal" style=3D"margin-bottom:12pt"><br>
&gt; Status: Finished successfully in 14.12 seconds<br>
&gt; OK<br>
&gt; 100000000<br>
&gt; Time taken: 14.38 seconds, Fetched: 1 row(s)<br>
<br>
That might be an improvement over MR, but that still feels far too slow.<br=
>
<br>
<br>
Parquet numbers are in general bad in Hive, but that&#39;s because the Parq=
uet<br>
reader gets no actual love from the devs. The community, if it wants to<br>
keep using Parquet heavily needs a Hive dev to go over to Parquet-mr and<br=
>
cut a significant number of memory copies out of the reader.<br>
<br>
The Spark 2.0 build for instance, has a custom Parquet reader for SparkSQL<=
br>
which does this. SPARK-12854 does for Spark+Parquet what Hive 2.0 does for<=
br>
ORC (actually, it looks more like hive&#39;s VectorizedRowBatch than<br>
Tungsten&#39;s flat layouts).<br>
<br>
But that reader cannot be used in Hive-on-Spark, because it is not a<br>
public reader impl.<br>
<br>
<br>
Not to pick an arbitrary dataset, my workhorse example is a TPC-H lineitem<=
br>
at 10Gb scale with a single 16 box.<br>
<br>
hive(tpch_flat_orc_10)&gt; select max(l_discount) from lineitem;<br>
Query ID =3D gopal_20160711175917_f96371aa-2721-49c8-99a0-f7c4a1eacfda<br>
Total jobs =3D 1<br>
Launching Job 1 out of 1<br>
<br>
<br>
Status: Running (Executing on YARN cluster with App id<br>
application_1466700718395_0256)<br>
<br>
---------------------------------------------------------------------------=
<br>
-------------------<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 VERTICES=C2=A0 =C2=A0 =C2=A0 MODE=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 STATUS=C2=A0 TOTAL=C2=A0 COMPLETED=C2=A0 RUNNING<br>
PENDING=C2=A0 FAILED=C2=A0 KILLED<br>
---------------------------------------------------------------------------=
<br>
-------------------<br>
Map 1 ..........=C2=A0 =C2=A0 =C2=A0 llap=C2=A0 =C2=A0 =C2=A0SUCCEEDED=C2=
=A0 =C2=A0 =C2=A013=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A013=C2=A0 =C2=A0 =C2=A0=
 =C2=A0 0<br>
0=C2=A0 =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A0 =C2=A00<br>
Reducer 2 ......=C2=A0 =C2=A0 =C2=A0 llap=C2=A0 =C2=A0 =C2=A0SUCCEEDED=C2=
=A0 =C2=A0 =C2=A0 1=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 1=C2=A0 =C2=A0 =C2=A0=
 =C2=A0 0<br>
0=C2=A0 =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A0 =C2=A00<br>
---------------------------------------------------------------------------=
<br>
-------------------<br>
VERTICES: 02/02=C2=A0 [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D&gt;&gt;] 100%=C2=A0 ELAPSED TIME: 0.71 s<br>
<br>
---------------------------------------------------------------------------=
<br>
-------------------<br>
Status: DAG finished successfully in 0.71 seconds<br>
<br>
Query Execution Summary<br>
---------------------------------------------------------------------------=
<br>
-------------------<br>
OPERATION=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 DURATION<br>
---------------------------------------------------------------------------=
<br>
-------------------<br>
Compile Query=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.21s<br>
Prepare Plan=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.13s<br>
Submit Plan=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.34s<br>
Start DAG=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.23s<br>
Run DAG=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.71s<br>
---------------------------------------------------------------------------=
<br>
-------------------<br>
<br>
Task Execution Summary<br>
---------------------------------------------------------------------------=
<br>
-------------------<br>
=C2=A0 VERTICES=C2=A0 =C2=A0DURATION(ms)=C2=A0 CPU_TIME(ms)=C2=A0 GC_TIME(m=
s)=C2=A0 INPUT_RECORDS<br>
OUTPUT_RECORDS<br>
---------------------------------------------------------------------------=
<br>
-------------------<br>
=C2=A0 =C2=A0 =C2=A0Map 1=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0604.00=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 0=C2=A0 =C2=A0 =C2=A059,957,438<br>
=C2=A0 =C2=A0 =C2=A0 13<br>
=C2=A0Reducer 2=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0105.00=C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A013<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A00<br>
---------------------------------------------------------------------------=
<br>
-------------------<br>
<br>
LLAP IO Summary<br>
---------------------------------------------------------------------------=
<br>
-------------------<br>
=C2=A0 VERTICES ROWGROUPS=C2=A0 META_HIT=C2=A0 META_MISS=C2=A0 DATA_HIT=C2=
=A0 DATA_MISS=C2=A0 ALLOCATION<br>
=C2=A0 =C2=A0 USED=C2=A0 TOTAL_IO<br>
---------------------------------------------------------------------------=
<br>
-------------------<br>
=C2=A0 =C2=A0 =C2=A0Map 1=C2=A0 =C2=A0 =C2=A0 6036=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A00=C2=A0 =C2=A0 =C2=A0 =C2=A0 146=C2=A0 =C2=A0 =C2=A0 =C2=A0 0B=C2=
=A0 =C2=A0 68.86MB=C2=A0 =C2=A0 491.00MB<br>
479.89MB=C2=A0 =C2=A0 =C2=A07.94s<br>
---------------------------------------------------------------------------=
<br>
-------------------<br>
<br>
OK<br>
0.1<br>
Time taken: 1.669 seconds, Fetched: 1 row(s)<br>
hive(tpch_flat_orc_10)&gt;<br>
<br>
<br>
This is running against a single 16 core box &amp; I would assume it would<=
br>
take &lt;1.4s to read twice as much (13 tasks is barely touching the load<b=
r>
factors).<br>
<br>
It would probably be a bit faster if the cache had hits, but in general<br>
14s to read a 100M rows is nearly a magnitude off where Hive 2.2.0 is.<br>
<br>
Cheers,<br>
Gopal<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<u></u><u></u></p>
</blockquote>
</div>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div></div><p></p></div>
</div>

</blockquote></div><br></div>
</div></blockquote></div></div></div></blockquote></div><br></div></div></d=
iv>
</blockquote></div><br></div>

<br>
</div></div><div class=3D"HOEnZb"><div class=3D"h5"><div style=3D"font-fami=
ly:Arial,Helvetica,sans-serif;font-size:1.3em"><div style=3D"font-family:ar=
ial,sans-serif;font-size:13.33px;background-color:rgb(255,255,255)"><div st=
yle=3D"font-family:Arial,Helvetica,sans-serif;font-size:1.3em"><div style=
=3D"font-family:arial,sans-serif;font-size:13.33px"><span style=3D"color:rg=
b(34,34,34)">Want to work at Handy? Check out our=C2=A0</span><span style=
=3D"color:rgb(34,34,34)"><a href=3D"http://www.handy.com/careers" target=3D=
"_blank">culture deck and open roles</a></span></div></div></div></div><div=
><div><div><div></div><div style=3D"font-family:arial,sans-serif;font-size:=
13.33px;background-color:rgb(255,255,255)"><span style=3D"color:rgb(34,34,3=
4)">Latest=C2=A0</span><span style=3D"color:rgb(34,34,34)"><a href=3D"http:=
//www.handy.com/press" target=3D"_blank">news</a></span><span style=3D"colo=
r:rgb(34,34,34)">=C2=A0at Handy</span></div><div style=3D"font-family:arial=
,sans-serif;font-size:13.33px;background-color:rgb(255,255,255)"><span styl=
e=3D"color:rgb(34,34,34)">Handy=C2=A0</span><span style=3D"color:rgb(34,34,=
34)"><a href=3D"http://venturebeat.com/2015/11/02/on-demand-home-service-ha=
ndy-raises-50m-in-round-led-by-fidelity/" target=3D"_blank">just raised $50=
m</a></span><span style=3D"color:rgb(34,34,34)">=C2=A0led by Fidelity</span=
></div><div style=3D"font-family:arial,sans-serif;font-size:13.33px;backgro=
und-color:rgb(255,255,255)"><span style=3D"color:rgb(34,34,34)"><br></span>=
</div><div style=3D"font-family:arial,sans-serif;font-size:13.33px;backgrou=
nd-color:rgb(255,255,255)"><img src=3D"http://marketing-email-assets.handyb=
ook.com/smalllogo.png"></div></div></div></div></div></div></blockquote></d=
iv><br></div>

--94eb2c124a32576ecf053770662c--