hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: How does tez calculate the number of Mappers/Reducers?
Date Sat, 25 Jun 2016 04:55:48 GMT

>Do you know how the number of splits is calculated?

To do that properly needs a whiteboard and a couple of hours - with the
primary complex variable being the YARN headroom calculation.

The simplest way to put it would be that it compute splits, tries to find
out the available capacity and tries to group the splits into waves across
the cluster capacity ranging from tez.grouping.min-size to
tez.grouping.max-size using the 3 locations that each split usually has.

The min-size is usually 16Mb and the max is usually 1Gb.

But everything starts off by calling InputFormat::getSplits() and then
collecting then into TezGroupedSplits - this makes it much nicer than
MapReduce's CombineInputFormat as non-file based splits can also be
merged/grouped.

Though that will predictably blow up when the splits show up as 0 sized.

>I also noticed a couple unusual things in our Splits(as seen below).
>Primarily getLength() always return 0l, which I¹m guessing is possibly
>causing other problems as well.

There are 3 problems common with custom split code usually

1) Size is unknown (0 or -1)
2) Locations are missing or is just "localhost" from the FileSystem
abstract class
3) If locations are present, most of those nodes don't run YARN

1 is a common problem with external data streams, 2 & 3 happen with
object-store filesystems today.

> Also our getSplits(..,..) function always returns an empty array of
>splits. Does that make sense?

The last part I hope isn't true. Your code looks like it is doing
splits.toArray(EMPTY_INPUT_SPLIT_ARRAY); which should reallocate a bigger
array to return.

>Also do you know if there is any developer documentation on custom
>StorageHandlers?

Not that I know of, but I'm more into Tez & LLAP than that code path.

Cheers,
Gopal



Mime
View raw message