hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Mapper Record Spillage
Date Sun, 11 Mar 2012 13:39:54 GMT
(Er, not sure how that ± got in there, I wished to type (-100, lowered
further if it continued to show problems)).

On Sun, Mar 11, 2012 at 7:08 PM, Harsh J <harsh@cloudera.com> wrote:
> Hans,
>
> I don't think io.sort.mb can support a whole 2048 value (it builds one
> array with the size, and JVM may not be allowing that). Can you lower
> it to 2000 ± 100 and try again?
>
> On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlig <huhlig@uhlisys.com> wrote:
>> If that is the case then these two lines should make more than enough
>> memory. On a virtually unused cluster.
>>
>> job.getConfiguration().setInt("io.sort.mb", 2048);
>> job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M");
>>
>> Such that a conversion from 1GB of CSV Text to binary primitives should fit
>> easily. but java still throws a heap error even when there is 25 GB of
>> memory free.
>>
>> On Sat, Mar 10, 2012 at 11:50 PM, Harsh J <harsh@cloudera.com> wrote:
>>>
>>> Hans,
>>>
>>> You can change memory requirements for tasks of a single job, but not
>>> of a single task inside that job.
>>>
>>> This is briefly how the 0.20 framework (by default) works: TT has
>>> notions only of "slots", and carries a maximum _number_ of
>>> simultaneous slots it may run. It does not know of what each task,
>>> occupying one slot, would demand in resource-terms. Your job then
>>> supplies a # of map tasks, and amount of memory required per map task
>>> in general, as a configuration. TTs then merely start the task JVMs
>>> with the provided heap configuration.
>>>
>>> On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig <huhlig@uhlisys.com> wrote:
>>> > That was a typo in my email not in the configuration. Is the memory
>>> > reserved
>>> > for the tasks when the task tracker starts? You seem to be suggesting
>>> > that I
>>> > need to set the memory to be the same for all map tasks. Is there no way
>>> > to
>>> > override for a single map task?
>>> >
>>> >
>>> > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J <harsh@cloudera.com> wrote:
>>> >>
>>> >> Hans,
>>> >>
>>> >> Its possible you may have an typo issue: mapred.map.child.jvm.opts -
>>> >> Such a property does not exist. Perhaps you wanted
>>> >> "mapred.map.child.java.opts"?
>>> >>
>>> >> Additionally, the computation you need to do is (# of map slots on a
>>> >> TT * per-map-task-heap-requirement) should be at least < (Total RAM
-
>>> >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of
>>> >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in
>>> >> parallel).
>>> >>
>>> >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig <huhlig@uhlisys.com>
wrote:
>>> >> > I am attempting to speed up a mapping process whose input is GZIP
>>> >> > compressed
>>> >> > CSV files. The files range from 1-2GB, I am running on a Cluster
>>> >> > where
>>> >> > each
>>> >> > node has a total of 32GB memory available to use. I have attempted
to
>>> >> > tweak
>>> >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048
to
>>> >> > accommodate the size but I keep getting java heap errors or other
>>> >> > memory
>>> >> > related problems. My row count per mapper is well below
>>> >> > Integer.MAX_INTEGER
>>> >> > limit by several orders of magnitude and the box is NOT using
>>> >> > anywhere
>>> >> > close
>>> >> > to its full memory allotment. How can I specify that this map
task
>>> >> > can
>>> >> > have
>>> >> > 3-4 GB of memory for the collection, partition and sort process
>>> >> > without
>>> >> > constantly spilling records to disk?
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Harsh J
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Mime
View raw message