hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Long, Andrew" <>
Subject Re: How does tez calculate the number of Mappers/Reducers?
Date Sat, 25 Jun 2016 23:14:13 GMT
Thanks once again!

“But everything starts off by calling InputFormat::getSplits()”

Correct me if I’m wrong but at this point isn’t the number of splits calculated?

    public InputSplit[] getSplits(JobConf conf, int numSplits) throws IOException {

After this the splits are then group into chunks which are passed to mappers/reducers?


Do you know which version of hive this was introduced in?

“splits.toArray(EMPTY_INPUT_SPLIT_ARRAY); which should reallocate a bigger
array to return“

Facepalm.  Reading comprehension is clearly not my strength >.<

Cheers Andrew

On 6/24/16, 9:55 PM, "Gopal Vijayaraghavan" < on behalf of>

>Do you know how the number of splits is calculated?

To do that properly needs a whiteboard and a couple of hours - with the
primary complex variable being the YARN headroom calculation.

The simplest way to put it would be that it compute splits, tries to find
out the available capacity and tries to group the splits into waves across
the cluster capacity ranging from tez.grouping.min-size to
tez.grouping.max-size using the 3 locations that each split usually has.

The min-size is usually 16Mb and the max is usually 1Gb.

But everything starts off by calling InputFormat::getSplits() and then
collecting then into TezGroupedSplits - this makes it much nicer than
MapReduce's CombineInputFormat as non-file based splits can also be

Though that will predictably blow up when the splits show up as 0 sized.

>I also noticed a couple unusual things in our Splits(as seen below).
>Primarily getLength() always return 0l, which I¹m guessing is possibly
>causing other problems as well.

There are 3 problems common with custom split code usually

1) Size is unknown (0 or -1)
2) Locations are missing or is just "localhost" from the FileSystem
abstract class
3) If locations are present, most of those nodes don't run YARN

1 is a common problem with external data streams, 2 & 3 happen with
object-store filesystems today.

> Also our getSplits(..,..) function always returns an empty array of
>splits. Does that make sense?

The last part I hope isn't true. Your code looks like it is doing
splits.toArray(EMPTY_INPUT_SPLIT_ARRAY); which should reallocate a bigger
array to return.

>Also do you know if there is any developer documentation on custom

Not that I know of, but I'm more into Tez & LLAP than that code path.


View raw message