spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cmccabe <...@git.apache.org>
Subject [GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Date Fri, 18 Jul 2014 19:45:45 GMT
Github user cmccabe commented on the pull request:

    https://github.com/apache/spark/pull/1486#issuecomment-49472027
  
    So, ideally we'd be able to set a different TaskLocality based on whether the replica
were cached or not.  Right now, getPreferredLocations just returns a string, making this difficult
to do.
    
    It seems like we have a few choices:
    1. simply reorder the replicas as this change does (disadvantage is we lose some locality
information)
    2. change the type of getPreferredLocations to return a type containing (hostname, Locality),
rather than simply string
    3. getPreferredLocations could continue to return strings, but we could add "cached:"
to the front of some.
    4. we could add a new function to RDD which would be used when available to return this
information.
    
    This patch is choice 1.
    
    Choice 2 might have some backwards compatibility issues.
    
    Choice 3 is a bit ugly, but is clearly the simplest.  Since colons are not valid characters
in hostnames, it seems safe as well.
    
    Choice 4 is a bit trickier since if any code fails to implement the new function, we fall
back to not knowing about cache locality, which isn't ideal.
    
    Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message