hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich" <ol...@yahoo-inc.com>
Subject RE: Propsoal for handling "GC overhead limit" errors
Date Wed, 11 Jun 2008 19:20:34 GMT
+1 

> -----Original Message-----
> From: Pradeep Kamath [mailto:pradeepk@yahoo-inc.com] 
> Sent: Wednesday, June 11, 2008 12:18 PM
> To: pig-dev@incubator.apache.org; pi.songs@gmail.com
> Subject: RE: Propsoal for handling "GC overhead limit" errors
> 
> The GC overhead limit error could occur even when we are not 
> low on memory but if memory is fragmented and if the GC 
> spends too much time freeing little memory. Also, we don't 
> want to slow down performance by invoking it too often. 
> Keeping these two in mind, I propose that the 
> GCActiviationSize be applied to the memory freed thus far 
> rather than applying it to the current Spillable's memory 
> size and to set a flag on when this size is reached and 
> invoke GC only once per handler invocation.
> 
> Also I would like to use the following defaults if it is reasonable:
>     // if we freed at least this much, invoke GC 
>     // (default 40 MB - this can be overridden by user supplied
> property)
>     private static long gcActivationSize = 40000000L ;
>     
>     // spill file size should be at least this much
>     // (default 5MB - this can be overridden by user supplied 
> property)
>     private static long spillFileSizeThreshold = 5000000L ;
>     
>     // fraction of biggest heap for which we want to get
>     // "memory usage threshold exceeded" notifications
>     private static double memoryThresholdFraction = 0.7;
>     
>     // fraction of biggest heap for which we want to get
>     // "collection threshold exceeded" notifications
>     private static double collectionMemoryThresholdFraction = 0.5;
> 
> 
> I am currently running more tests to check if previously seen 
> issues with queries are now solved with these changes.
> 
> -Pradeep
> 
> -----Original Message-----
> From: pi song [mailto:pi.songs@gmail.com]
> Sent: Wednesday, June 11, 2008 7:15 AM
> To: pig-dev@incubator.apache.org
> Subject: Re: Propsoal for handling "GC overhead limit" errors
> 
> Sorry. It's actually Long.MAX_VALUE, not Integer.
> 
> On Thu, Jun 12, 2008 at 12:12 AM, pi song <pi.songs@gmail.com> wrote:
> 
> > Pradeep,
> >
> > I totally buy your biggest_heap*0.7 idea.
> >
> > BUT!!, I've tried this:-
> >
> >         for(int i=0;i<100000;i++) {
> >             StringBuilder sb = new StringBuilder() ;
> >              for(int j=0;j<100;j++) {
> >                 sb.append("hodgdfdsfsddf")   ;
> >              }
> >             System.gc();
> >         }
> > And it doesn't give me any error. So I think calling too 
> often is not
> a
> > problem except it might be slow.
> >
> > GCActiviationSize by default is set to Integer.MAX_VALUE. I believe
> most
> > people have never used.  So, it should have nothing to do with the
> current
> > problem.
> >
> > My concern about using soft/weak reference for data in bag 
> is that if
> the
> > granularity is too fine, we will need more space for those 
> additional 
> > pointers.
> >
> > Pi
> >
> >
> > On Wed, Jun 11, 2008 at 5:51 AM, Mridul Muralidharan < 
> > mridulm@yahoo-inc.com> wrote:
> >
> >>
> >>
> >> Ideally, instead of using SpillableMemoryManager, it might 
> be better
> to -
> >>
> >> a) use soft/weak reference to refer to the data in a bag/tuple.
> >> a.1) soft reference since it is less gc sensitive as 
> compared to weak 
> >> reference (a gc kicks all weak ref's out typically). So soft ref's
> are sort
> >> of like a cache which are not so frequently kicked.
> >> b) register them with reference queue and manage the life cycle of 
> >> referrent (to spill/not spill).
> >>  ) override get/put in bag/tuple such that we load off the disk if
> the
> >> referrent is null (this should already be done in some way in the
> code
> >> currently).
> >>
> >>
> >> Ofcourse, this is much more work and is slightly more tricky ... so
> if
> >> SpillablyMemoryManager can handle the requirements, it should work
> fine.
> >>
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >>
> >> Pradeep Kamath wrote:
> >>
> >>> Hi,
> >>>
> >>>
> >>> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
> >>>
> >>>
> >>> 1) We use MemoryManagement interface to get notified when the 
> >>> "collection threshold" exceeds a limit (we set this to 
> >>> biggest_heap*0.5). With this in place we are still seeing "GC
> overhead
> >>> limit" issues when trying large dataset operations. Observing some
> runs,
> >>> it looks like the notification is not frequent enough and early
> enough
> >>> to prevent memory issues possibly because this notification only
> occurs
> >>> after GC.
> >>>
> >>>
> >>> 2) We only attempt to free upto :
> >>>
> >>> long toFree = info.getUsage().getUsed() - 
> >>> (long)(info.getUsage().getMax()*.5);
> >>>
> >>> This is only the excess amount over the threshold which 
> caused the 
> >>> notification and is not sufficient to not be called again soon.
> >>>
> >>>
> >>> 3) While iterating over spillables, if current spillable's memory
> size
> >>> is > gcActivationSize, we try to invoke System.gc
> >>>
> >>>
> >>> 4) We *always* invoke System.gc() after iterating over spillables
> >>>
> >>>
> >>> Proposed changes are:
> >>>
> >>> =================
> >>>
> >>> 1) In addition to "collection threshold" of biggest_heap*0.5, a
> "usage
> >>> threshold" of biggest_heap*0.7 will be used so we get 
> notified early
> and
> >>> often irrespective of whether garbage collection has occured.
> >>>
> >>>
> >>> 2) We will attempt to free
> >>> toFree = info.getUsage().getUsed() - threshold + 
> (long)(threshold * 
> >>> 0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
> >>> handleNotification() method is handling a "usage 
> threshold exceeded"
> >>> notification and (info.getUsage().getMax() * 0.5) otherwise
> ("collection
> >>> threshold exceeded" case)
> >>>
> >>>
> >>> 3) While iterating over spillables, if the *memory freed thus far*
> is >
> >>> gcActivationSize OR if we have freed sufficient memory 
> (based on 2) 
> >>> above), then we set a flag to invoke System.gc when we exit the
> loop.
> >>>
> >>> 4) We will invoke System.gc() only if the flag is set in 3) above
> >>>
> >>>
> >>> Please provide thoughts/comments.
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Pradeep
> >>>
> >>>
> >>>
> >>
> >
> 

Mime
View raw message