Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@spark.apache.org
From: tdas <git@git.apache.org>
To: dev@spark.apache.org
Reply-To: dev@spark.apache.org
References: <git-pr-126-spark@git.apache.org>
In-Reply-To: <git-pr-126-spark@git.apache.org>
Subject: [GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage
 collectio...
Content-Type: text/plain
Message-Id: <20140313002529.D17DB943417@tyr.zones.apache.org>
Date: Thu, 13 Mar 2014 00:25:29 +0000 (UTC)

Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/126#issuecomment-37485786
  
    @yaoshengzhe I agree using finalizer is not the most ideal thing in the world. However, the problem that we are dealing with here is that there is no clean and safe way to detect whether an RDD or a shuffle has gone out of scope, other than, using the garbage collection mechanisms. There already exists mechanisms like RDD.unpersist() to cleanup  persisted RDDs. That is, as long as the developer diligently keeps track of all RDDs and make sure to unpersist them while keeping track of dependencies. That's a pain, just like malloc and free. Similarly, for the shuffle data (map outputs), its hard to figure out when all the RDDs that depend on the shuffle data, so its hard to figure out when it is safe to clean up the shuffle data. Furthermore, if you consider RDD checkpointing, which transparently modifies the RDD DAG structure behind the scenes, its get even harder to keep track of RDDs and clean them. So the only safe way is to use Java garbage collection mechanism. 
    
    However, one can argue that one can implement this functionality without using finalizer() by using weak references and reference queues (reference queues keep track which objects got garbage collected). However, that requires all RDDs, etc. to be wrapped with WeakReference objects. That's much complicated and error-prone solution. Hence, I have used finalizer() for now. As @andrewor14 has already pointed out that I have taken care in making sure the finalizer() function is as cheap as possible (just a insert into a queue). And regarding what the article says about object initialization being long if finalize() function is define, I think it is an acceptable overhead (few ms) as RDDs are not created at the rate of 1000s per second.
    

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---