hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: How does tez calculate the number of Mappers/Reducers?
Date Mon, 27 Jun 2016 20:01:02 GMT

>Correct me if I¹m wrong but at this point isn¹t the number of splits
>calculated?

Yes you are correct, but the grouping kicks in after that.

The real reason for grouping is because Shuffle operations are internally
MxN and explode out of control if grouping hasn't been done.

Running through 5000 mappers x 200 reducers can accidentally run out of
ephemeral ports (i.e 64,000 ephemeral ports vs 100,000 shuffle
connections).

Grouping that so that each mapper processes a minimum of 16Mb saves a ton
of those connections by reduce M.

And auto-reducer parallelism is so that the N gets cut down.

>³ColumnarSplitSizeEstimator.java²
>
>Do you know which version of hive this was introduced in?

It's in hive-1.3.x and hive-2.0, the reason it isn't backported back into
1.2 is because it depends on a tez-0.7 feature & 1.2 is tied to 0.5.4.

Pretty sure EMR runs a more modern Tez, so that backport isn't hard to do
from HIVE-7428.

Cheers,
Gopal



Mime
View raw message