hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Hive on TEZ + LLAP
Date Tue, 19 Jul 2016 14:44:36 GMT
Thanks

In this sample query

select  i_brand_id brand_id, i_brand brand,
        sum(ss_ext_sales_price) ext_price
 from
*date_dim, store_sales, item * where date_dim.d_date_sk =
store_sales.ss_sold_date_sk
        and store_sales.ss_item_sk = item.i_item_sk
        and i_manager_id=36
        and d_moy=12
        and d_year=2001
 group by i_brand, i_brand_id
 order by ext_price desc, i_brand_id
limit 100 ;

What was the type (Parquet, text, ORC etc) and row count for each three
tables above?

thanks


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 July 2016 at 02:17, Gopal Vijayaraghavan <gopalv@apache.org> wrote:

>
> > These looks pretty impressive. What execution mode were you running
> >these? Yarn client may be?
>
> There is no other mode - everything runs on YARN.
>
> > 53 times
>
>
> The factor is actually bigger in actual execution.
>
> The MRv2 version takes 2.47s to prep a query, while the LLAP version takes
> 1.64s.
>
> The MRv2 version takes 200.319s to execute the query, while the LLAP
> version takes 1.02s.
>
> The execution factor is nearly ~200x, but the compile becomes significant
> as you scale down the latencies.
>
> > My calculations on Hive 2 on Spark 1.3.1
>
> Not sure where Hive2-on-Spark is going - the last commit to SparkCompiler
> was late last year, before there was a Hive2.
>
> On the speed front, I'm pretty sure you have got most of the Hive2
> optimizations disabled, even the most basic of the Stinger optimizations
> might be missing for you.
>
> Check if you have
>
> set hive.vectorized.execution.enabled=true;
>
>
> Some of these new optimizations don't work on H-o-S, because Hive-on-Spark
> does not implement a true broadcast join - instead it uses a
> SparkHashTableSinkOperatorwhich actually writes to HDFS instead of sending
> it directy to the downstream task.
>
>
> I don't understand why that is the case instead of RDD brodcast, but that
> prevents the JOIN optimizations which convert the 34 sec query into a 3.8
> sec query from applying to Spark execution.
>
> A couple of examples would be
>
> set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
> set hive.vectorized.execution.mapjoin.minmax.enabled=true;
>
> Those two make easy work of joins in LLAP, particularly semi-joins which
> are common in BI queries.
>
>
> Once LLAP is out of tech preview, we can enable most of them by default
> for Tez+LLAP, but that would not mean all of it applies to
> Hive-on-(Spark/MR).
>
> Getting these new features onto another engine takes active effort from
> the engine's devs.
>
> Cheers,
> Gopal
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message