Hi All,

I've a pig script which runs over YARN. Each MAP task created by this pig script is taking 128MB as input and not more than that.

I want to increase the input size of each map job. I've read that input size is determined using following formula:

max(min split size, min(block size, max split size)).

Following are the values I'm setting for these parameters:

dfs.blocksize = 134217728
mapreduce.input.fileinputformat.split.maxsize = 1610612736
mapreduce.input.fileinputformat.split.minsize = 805306368
mapreduce.input.fileinputformat.split.minsize.per.node = 222298112
mapreduce.input.fileinputformat.split.minsize.per.rack = 222298112

According the values configured the input size should be 805306368 but it is still 134217728 which same as dfs.blocksize.

But every time I change my dfs.blocksize to higher value the input to MAP tasks increase by the same amount.

Following is the setup:
Cloudera : 5.5.1
Hadoop: 2.6.0
Pig: 0.12.0