Recently = we are changing some data warehouse tables from textfile to orc format. Som= e of our hive SQL which read these orc tables failed at reduce stage.=C2=A0= Reducers failed while copying Map outputs with following exception:=

Caused by: java.lang.OutOfMemoryError: Java heap space
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.hadoop.io.Bound= edByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)<= br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.hadoop.io.Bo= undedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:4= 6)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.hadoop.mapreduc= e.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.hadoop.mapreduc= e.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:2= 97)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.hadoop.mapreduc= e.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:287)
=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.hadoop.mapreduce.task= .reduce.Fetcher.copyMapOutput(Fetcher.java:411)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.hadoop.mapreduc= e.task.reduce.Fetcher.copyFromHost(Fetcher.java:341)
=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.hadoop.mapreduce.task.reduce.Fetc= her.run(Fetcher.java:165)

The query strings of affected SQL are in the below form:

set hive.exec.dynamic.partition.mode=3Dnonstrict;
set hive.exec.dynamic.partition=3Dtrue;
set hive.exec.max.dynamic.= partitions=3D10000;
insert overwrite table xxx partition(`day`)
select=C2=A0=C2=A0=C2= =A0=C2=A0
=C2=A0=C2=A0=C2=A0 count(distinct col1) c1,=C2=A0=C2=A0=C2= =A0
=C2=A0=C2=A0=C2=A0 count(distinct col2) c2,
=C2=A0=C2=A0=C2=A0 ...=
=C2=A0=C2=A0=C2=A0 count(distinct col11) col11
from t
group by col12, col13

Here= we consider one specific job with 24G totalInputFileSize (orc compressed),= =C2=A0 it launches 97 maps (mapred.max.split.size is 256M) and 30 r= educes(hive.exec.reducers.bytes.per.reducer =3D 1G).

Sinc= e there are so many distinct, the total reduce shuffle bytes increase to 59= G (lzo compressed, around 550G after decompressed). The average map output = bytes each reducer fetch will be 550 * 1024 / 97 / 30 =3D 193M. Here I noti= ce two default params which control the memory usage of shuffling process:<= /span>

mapredu= ce.reduce.shuffle.input.buffer.percent=C2=A0 =3D 0.9
mapreduce.reduce.= shuffle.memory.limit.percent=C2=A0 =3D 0.25

=C2= =A0

the = memoryLimit and=C2=A0maxSingleShuffleLimit is as below:

memoryL= imit =3D total_memory * $mapreduce.reduce.shuffle.input.buffer.percent=C2=A0

= maxSingleShuffleLimit = =3D memoryLimit * $mapreduce.reduce.shuffle.memory.limit.percent
Here maxSingleShuffleLim= it is the threshold for shuffling map output to memory.

From= the log we can find all the runtime params:

2014-05= -04 16:39:27,129 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeM= anagerImpl: MergerManager: memoryLimit=3D1336252800, maxSingleShuffleLimit= =3D334063200, mergeThreshold=3D881926912, ioSortFactor=3D10, memToMemMergeO= utputsThreshold=3D10

Consider that the used m= emory is near memoryLimit and we shuffle another map output to memory, the = total memory used may exceed under this configuration=EF=BC=9A
total_memory_used =3D=C2= =A0memoryLimit + ma= xSingleShuffleLimit=C2=A0
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D =C2=A0=C2=A0total_memory * inp= ut_buffer_percent * (1 + memory_limit_percent)
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D =C2=A0 total_memory * 0.9 * 1.= 25 =3D total_memory * 1.125=C2=A0

When I set mapreduce.reduc= e.shuffle.input.buffer.percent to 0.6 the job runs well.

Here are my questions:
1. Are the default setti= ngs for shuffling suitable? Or do I miss something?
2. Though the job use le= ss maps and reduces after we compress data with orc format, but it runs slo= wer than before. When I increase the reduce numbers it use less time. I won= der maybe we can improve the algorithm of estimateNumberOfReducers and take= input data format into consideration?=C2=A0

Any help is appreciated.