hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce
Date Mon, 26 Jul 2010 19:26:16 GMT

    [ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892423#action_12892423

Scott Carey commented on PIG-1516:

You avoid finalize() by using a WeakReference. There is no situation that you can't substitute
a weak reference for a finalizer, other than object resurrection which is a really bad idea.

finalize() should be avoided for any case that might create many objects.  Its OK to ask the
GC to deal with a small number of objects that themselves don't hold many resources.  Its
bad form to use finalize() in any case where throughput is high or the object is potentially
large.  It will kill performance and  thrash GC.

Extend WeakReference and put the things you need to clean up in it as member variables. Those
should also have a strong reference from the bag  the WeakReference should be strongly referenced
from the Bag too.   When the Bag is GC'd the objects of interest will no longer have any reference
to them other than the WeakReference, and the WeakReference will no longer be strongly referenced.
 The WeakReference will be placed onto a Queue of your choosing, and you can then process
the queue and access the data required to do any cleanup. 
Unlike a finalizer, the actual object is released when GC happens and does not linger.  Only
the WeakReference and what it holds onto remains, and you get notified (via the queue) when
the object is gone.  Therefore, you have control over your resources and do not rely on the
JVM to run the finalizer. 

I have seen performance improvements of ~10x due to moving high volume finalizers to a weak
reference queue implementation, along with significantly lower memory consumption.

> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods
implemented. As a result, the garbage collector moves them to a finalization queue, and the
memory used is freed only after the finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the finalization
queue, and pig runs out of memory. This can happen if large number of small bags are being
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that are created
when the bag is too large. But if the bags are small enough, no spill files are created, and
there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This class will
have a finalize method that deletes the files. The bags will no longer have finalize methods,
and the bags will use FileList instead of ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as there will
not be the stage of combining intermediate merge results. But I would recommend disabling
it only if you have this problem as it is likely to slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message