Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6A72C10FAC for ; Thu, 13 Mar 2014 00:25:31 +0000 (UTC) Received: (qmail 47498 invoked by uid 500); 13 Mar 2014 00:25:30 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 47406 invoked by uid 500); 13 Mar 2014 00:25:30 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@spark.apache.org Delivered-To: mailing list dev@spark.apache.org Received: (qmail 47397 invoked by uid 99); 13 Mar 2014 00:25:30 -0000 Received: from tyr.zones.apache.org (HELO tyr.zones.apache.org) (140.211.11.114) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Mar 2014 00:25:30 +0000 Received: by tyr.zones.apache.org (Postfix, from userid 65534) id D17DB943417; Thu, 13 Mar 2014 00:25:29 +0000 (UTC) From: tdas To: dev@spark.apache.org Reply-To: dev@spark.apache.org References: In-Reply-To: Subject: [GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio... Content-Type: text/plain Message-Id: <20140313002529.D17DB943417@tyr.zones.apache.org> Date: Thu, 13 Mar 2014 00:25:29 +0000 (UTC) Github user tdas commented on the pull request: https://github.com/apache/spark/pull/126#issuecomment-37485786 @yaoshengzhe I agree using finalizer is not the most ideal thing in the world. However, the problem that we are dealing with here is that there is no clean and safe way to detect whether an RDD or a shuffle has gone out of scope, other than, using the garbage collection mechanisms. There already exists mechanisms like RDD.unpersist() to cleanup persisted RDDs. That is, as long as the developer diligently keeps track of all RDDs and make sure to unpersist them while keeping track of dependencies. That's a pain, just like malloc and free. Similarly, for the shuffle data (map outputs), its hard to figure out when all the RDDs that depend on the shuffle data, so its hard to figure out when it is safe to clean up the shuffle data. Furthermore, if you consider RDD checkpointing, which transparently modifies the RDD DAG structure behind the scenes, its get even harder to keep track of RDDs and clean them. So the only safe way is to use Java garbage collection mechanism. However, one can argue that one can implement this functionality without using finalizer() by using weak references and reference queues (reference queues keep track which objects got garbage collected). However, that requires all RDDs, etc. to be wrapped with WeakReference objects. That's much complicated and error-prone solution. Hence, I have used finalizer() for now. As @andrewor14 has already pointed out that I have taken care in making sure the finalizer() function is as cheap as possible (just a insert into a queue). And regarding what the article says about object initialization being long if finalize() function is define, I think it is an acceptable overhead (few ms) as RDDs are not created at the rate of 1000s per second. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. ---