hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "pi song" <pi.so...@gmail.com>
Subject Re: Propsoal for handling "GC overhead limit" errors
Date Wed, 11 Jun 2008 14:12:05 GMT
Pradeep,

I totally buy your biggest_heap*0.7 idea.

BUT!!, I've tried this:-

        for(int i=0;i<100000;i++) {
            StringBuilder sb = new StringBuilder() ;
             for(int j=0;j<100;j++) {
                sb.append("hodgdfdsfsddf")   ;
             }
            System.gc();
        }
And it doesn't give me any error. So I think calling too often is not a
problem except it might be slow.

GCActiviationSize by default is set to Integer.MAX_VALUE. I believe most
people have never used.  So, it should have nothing to do with the current
problem.

My concern about using soft/weak reference for data in bag is that if the
granularity is too fine, we will need more space for those additional
pointers.

Pi

On Wed, Jun 11, 2008 at 5:51 AM, Mridul Muralidharan <mridulm@yahoo-inc.com>
wrote:

>
>
> Ideally, instead of using SpillableMemoryManager, it might be better to -
>
> a) use soft/weak reference to refer to the data in a bag/tuple.
> a.1) soft reference since it is less gc sensitive as compared to weak
> reference (a gc kicks all weak ref's out typically). So soft ref's are sort
> of like a cache which are not so frequently kicked.
> b) register them with reference queue and manage the life cycle of
> referrent (to spill/not spill).
>  ) override get/put in bag/tuple such that we load off the disk if the
> referrent is null (this should already be done in some way in the code
> currently).
>
>
> Ofcourse, this is much more work and is slightly more tricky ... so if
> SpillablyMemoryManager can handle the requirements, it should work fine.
>
>
> Regards,
> Mridul
>
>
>
> Pradeep Kamath wrote:
>
>> Hi,
>>
>>
>> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
>>
>>
>> 1) We use MemoryManagement interface to get notified when the
>> "collection threshold" exceeds a limit (we set this to
>> biggest_heap*0.5). With this in place we are still seeing "GC overhead
>> limit" issues when trying large dataset operations. Observing some runs,
>> it looks like the notification is not frequent enough and early enough
>> to prevent memory issues possibly because this notification only occurs
>> after GC.
>>
>>
>> 2) We only attempt to free upto :
>>
>> long toFree = info.getUsage().getUsed() -
>> (long)(info.getUsage().getMax()*.5);
>>
>> This is only the excess amount over the threshold which caused the
>> notification and is not sufficient to not be called again soon.
>>
>>
>> 3) While iterating over spillables, if current spillable's memory size
>> is > gcActivationSize, we try to invoke System.gc
>>
>>
>> 4) We *always* invoke System.gc() after iterating over spillables
>>
>>
>> Proposed changes are:
>>
>> =================
>>
>> 1) In addition to "collection threshold" of biggest_heap*0.5, a "usage
>> threshold" of biggest_heap*0.7 will be used so we get notified early and
>> often irrespective of whether garbage collection has occured.
>>
>>
>> 2) We will attempt to free
>> toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
>> 0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
>> handleNotification() method is handling a "usage threshold exceeded"
>> notification and (info.getUsage().getMax() * 0.5) otherwise ("collection
>> threshold exceeded" case)
>>
>>
>> 3) While iterating over spillables, if the *memory freed thus far* is >
>> gcActivationSize OR if we have freed sufficient memory (based on 2)
>> above), then we set a flag to invoke System.gc when we exit the loop.
>>
>> 4) We will invoke System.gc() only if the flag is set in 3) above
>>
>>
>> Please provide thoughts/comments.
>>
>>
>> Thanks,
>>
>> Pradeep
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message