hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <>
Subject Re: Hive on Tez: Tez taking nX more containers than Mapreduce for union all
Date Fri, 17 Mar 2017 15:31:04 GMT

> We are using a query with union all and groupby and same table is read multiple times
in the union all subquery.
> When run with Mapreduce, the job is run in one stage consuming n mappers and m reducers
and all union all scans are done with the same job.

The logical plans are identical btw - MR effectively reads the same table again and again,
unless the correlation optimizer is folding this.

I doubt that due to the unix_timestamp(). An explain would be useful.

> Hence if there are 50 union alls in a query, the 50n map vertex tasks are launched which
is huge.

Tez lets you scale the mappers up/down using split grouping parameters, so you can tweak it
to scale down if you want to.

set tez.grouping.split-waves=0.1;

would try to shrink the width of the mappers.

An alternative is to use a CTE + materialization (HIVE-11752), but for that you need Hive2.


You can probably get a ~2x speedup by removing the UNIX_TIMESTAMP() and using CURRENT_TIMESTAMP


View raw message