hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "kiran sreekumar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2308) Sort buffer size (io.sort.mb) is limited to < 2 GB
Date Tue, 05 Feb 2013 04:48:16 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13571040#comment-13571040
] 

kiran sreekumar commented on MAPREDUCE-2308:
--------------------------------------------

io.sort.mb should be 10 * io.sort.factor.
HADOOP-3473 suggests to keep it default as 100.
                
> Sort buffer size (io.sort.mb) is limited to < 2 GB
> --------------------------------------------------
>
>                 Key: MAPREDUCE-2308
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2308
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.1, 0.20.2, 0.21.0
>         Environment: Cloudera CDH3b3 (0.20.2+)
>            Reporter: Jay Hacker
>            Priority: Minor
>
> I have MapReduce jobs that use a large amount of per-task memory, because the algorithm
I'm using converges faster if more data is together on a node.  I have my JVM heap size set
at 3200 MB, and if I use the popular rule of thumb that io.sort.mb should be ~70% of that,
I get 2240 MB.  I rounded this down to 2048 MB, but map tasks crash with :
> {noformat}
> java.io.IOException: Invalid "io.sort.mb": 2048
>         at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:790)
>         ...
> {noformat}
> MapTask.MapOutputBuffer implements its buffer with a byte[] of size io.sort.mb (in bytes),
and is sanity checking the size before allocating the array.  The problem is that Java arrays
can't have more than 2^31 - 1 elements (even with a 64-bit JVM), and this is a limitation
of the Java language specificiation itself.  As memory and data sizes grow, this would seem
to be a crippling limtiation of Java.
> It would be nice if this ceiling were documented, and an error issued sooner, e.g. in
jobtracker startup upon reading the config.  Going forward, we may need to implement some
array of arrays hack for large buffers. :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message