hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: Hive on Tez: Tez taking nX more containers than Mapreduce for union all
Date Fri, 17 Mar 2017 15:31:04 GMT

> We are using a query with union all and groupby and same table is read multiple times
in the union all subquery.
…
> When run with Mapreduce, the job is run in one stage consuming n mappers and m reducers
and all union all scans are done with the same job.

The logical plans are identical btw - MR effectively reads the same table again and again,
unless the correlation optimizer is folding this.

I doubt that due to the unix_timestamp(). An explain would be useful.

> Hence if there are 50 union alls in a query, the 50n map vertex tasks are launched which
is huge.

Tez lets you scale the mappers up/down using split grouping parameters, so you can tweak it
to scale down if you want to.

set tez.grouping.split-waves=0.1;

would try to shrink the width of the mappers.

An alternative is to use a CTE + materialization (HIVE-11752), but for that you need Hive2.

> http://pastebin.com/u7Rw6Hag

You can probably get a ~2x speedup by removing the UNIX_TIMESTAMP() and using CURRENT_TIMESTAMP
instead.

Cheers,
Gopal



Mime
View raw message