I've a pig script which runs over YARN. Each MAP task created by this pig script is taking 128MB as input and not more than that.
I want to increase the input size of each map job. I've read that input size is determined using following formula:
max(min split size, min(block size, max split size)).
Following are the values I'm setting for these parameters:
dfs.blocksize = 134217728
mapreduce.input.fileinputformat.split.maxsize = 1610612736
mapreduce.input.fileinputformat.split.minsize.per.node = 222298112
According the values configured the input size should be 805306368 but it is still 134217728 which same as dfs.blocksize.
But every time I change my dfs.blocksize to higher value the input to MAP tasks increase by the same amount.
Following is the setup:
Cloudera : 5.5.1