spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tdas <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...
Date Thu, 13 Mar 2014 00:25:29 GMT
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/126#issuecomment-37485786
  
    @yaoshengzhe I agree using finalizer is not the most ideal thing in the world. However,
the problem that we are dealing with here is that there is no clean and safe way to detect
whether an RDD or a shuffle has gone out of scope, other than, using the garbage collection
mechanisms. There already exists mechanisms like RDD.unpersist() to cleanup  persisted RDDs.
That is, as long as the developer diligently keeps track of all RDDs and make sure to unpersist
them while keeping track of dependencies. That's a pain, just like malloc and free. Similarly,
for the shuffle data (map outputs), its hard to figure out when all the RDDs that depend on
the shuffle data, so its hard to figure out when it is safe to clean up the shuffle data.
Furthermore, if you consider RDD checkpointing, which transparently modifies the RDD DAG structure
behind the scenes, its get even harder to keep track of RDDs and clean them. So the only safe
way is to use Java garbage collection mechanism. 
    
    However, one can argue that one can implement this functionality without using finalizer()
by using weak references and reference queues (reference queues keep track which objects got
garbage collected). However, that requires all RDDs, etc. to be wrapped with WeakReference
objects. That's much complicated and error-prone solution. Hence, I have used finalizer()
for now. As @andrewor14 has already pointed out that I have taken care in making sure the
finalizer() function is as cheap as possible (just a insert into a queue). And regarding what
the article says about object initialization being long if finalize() function is define,
I think it is an acceptable overhead (few ms) as RDDs are not created at the rate of 1000s
per second.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message