spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From steveloughran <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-5682] Reuse hadoop encrypted shuffle al...
Date Thu, 19 Mar 2015 14:55:29 GMT
Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/4491#issuecomment-83615286
  
    1. the `"InterfaceAudience.Private"` tags in Hadoop are a "please don't use` hint, although
if you look at YARN AMs, they end up importing & using stuff which is tagged that way;
you can't current do an AM which uses it. What it does mean is: they may be unintentionally
changed, including signatures and semantics, and if they break your code, it's your responsibility
to find that out and complain before the next hadoop release ships. Summary: test against
hadoop trunk or at least beta releases.
    
    2. The crypto code is still encountering a few stabilisation problems related to multithreading,
stuff that doesn't show up in the unit tests. The code in 2.6 has already be supplanted by
the code in branch-2/trunk. Forking off your own code means tracking those changes and keeping
in sync...keeping the code in Java would aid diffing and cherry picking there. Even without
trying to handle the quirks of the extended Hadoop streams, concurrency issues like [HADOOP-11710](https://issues.apache.org/jira/browse/HADOOP-11710)
may matter.
    
    3. There's also the problem that encryption performance comes from native binaries; which
means for YARN deployments: either bind to the hadoop.so/.dll on the PATH , or push up a new
version & extend PATH in container launch contexts, and on other deployments come up with
new solutions. If you can stick to JCE routines (as this patch does) life may be simpler.
    
    A standalone security JAR+ library would be better, with code shared by both Hadoop &
other apps. You could talk to the Hadoop project about isolating it in Hadoop itself, though
that will imply a separate native build & lib, etc.
    
    The other tactic is to make the shuffle mechanism more pluggable, and on YARN clusters
switch to an encrypted shuffle provided by a separate library, or use the YARN NM via whatever
extension points need to be added. The latter tactic will avoid any native library path setup
issues, and will allow alternative deployments (standalone, mesos) to switch to an encrypted
shuffle later
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message