hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Mapper Record Spillage
Date Sun, 11 Mar 2012 13:38:48 GMT
Hans,

I don't think io.sort.mb can support a whole 2048 value (it builds one
array with the size, and JVM may not be allowing that). Can you lower
it to 2000 ± 100 and try again?

On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlig <huhlig@uhlisys.com> wrote:
> If that is the case then these two lines should make more than enough
> memory. On a virtually unused cluster.
>
> job.getConfiguration().setInt("io.sort.mb", 2048);
> job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M");
>
> Such that a conversion from 1GB of CSV Text to binary primitives should fit
> easily. but java still throws a heap error even when there is 25 GB of
> memory free.
>
> On Sat, Mar 10, 2012 at 11:50 PM, Harsh J <harsh@cloudera.com> wrote:
>>
>> Hans,
>>
>> You can change memory requirements for tasks of a single job, but not
>> of a single task inside that job.
>>
>> This is briefly how the 0.20 framework (by default) works: TT has
>> notions only of "slots", and carries a maximum _number_ of
>> simultaneous slots it may run. It does not know of what each task,
>> occupying one slot, would demand in resource-terms. Your job then
>> supplies a # of map tasks, and amount of memory required per map task
>> in general, as a configuration. TTs then merely start the task JVMs
>> with the provided heap configuration.
>>
>> On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig <huhlig@uhlisys.com> wrote:
>> > That was a typo in my email not in the configuration. Is the memory
>> > reserved
>> > for the tasks when the task tracker starts? You seem to be suggesting
>> > that I
>> > need to set the memory to be the same for all map tasks. Is there no way
>> > to
>> > override for a single map task?
>> >
>> >
>> > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J <harsh@cloudera.com> wrote:
>> >>
>> >> Hans,
>> >>
>> >> Its possible you may have an typo issue: mapred.map.child.jvm.opts -
>> >> Such a property does not exist. Perhaps you wanted
>> >> "mapred.map.child.java.opts"?
>> >>
>> >> Additionally, the computation you need to do is (# of map slots on a
>> >> TT * per-map-task-heap-requirement) should be at least < (Total RAM -
>> >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of
>> >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in
>> >> parallel).
>> >>
>> >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig <huhlig@uhlisys.com> wrote:
>> >> > I am attempting to speed up a mapping process whose input is GZIP
>> >> > compressed
>> >> > CSV files. The files range from 1-2GB, I am running on a Cluster
>> >> > where
>> >> > each
>> >> > node has a total of 32GB memory available to use. I have attempted
to
>> >> > tweak
>> >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to
>> >> > accommodate the size but I keep getting java heap errors or other
>> >> > memory
>> >> > related problems. My row count per mapper is well below
>> >> > Integer.MAX_INTEGER
>> >> > limit by several orders of magnitude and the box is NOT using
>> >> > anywhere
>> >> > close
>> >> > to its full memory allotment. How can I specify that this map task
>> >> > can
>> >> > have
>> >> > 3-4 GB of memory for the collection, partition and sort process
>> >> > without
>> >> > constantly spilling records to disk?
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Mime
View raw message