bahir-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sourav-mazumder <>
Subject [GitHub] bahir issue #28: [BAHIR-75] [WIP] Remote HDFS connector for Apache Spark usi...
Date Wed, 01 Feb 2017 15:33:48 GMT
Github user sourav-mazumder commented on the issue:
    @ckadner Here goes my response to your comments
    > Can you elaborate on differences/limitations/advantages over Hadoop default "webhdfs"
scheme? i.e.
    the main problem you are working around it that the Hadoop WebHdfsFileSystem discards
Knox gateway path when creating Http URL (principal motivation for this connector) which makes
it impossible to use it with Knox
    > the Hadoop WebHdfsFileSystem implements additional interfaces like: 
    This is automatically taken care of by Apache Knox, in my understanding. That is one of
the key goals of Apache Knox to relieve hadoop clients from nitigrity of internal security
implementation of a hadoop Cluster. So we don't need to handle this at the code in client
level if the webhdfs request is passing through Apache Knox.
    > performance differences between your approach vs Hadoop's RemoteFS and WebHDFS
    Say a remote Spark cluster needs to read a file of size 2 GB and the Spark Cluster spawns
16 connections in parallel to do the same. So in turn 16 separate webhdfs calls are made to
remote hdfs. However, though each call tries to read the data from different starting point,
for each of them the end byte is the end of file. So first connection creates input stream
corresponding to 0th byte till end of file, second from 128MB till end of file, the 3rd from
256 MB till and of file and so on. As a result of that the amount of data prepared in the
server side for sending as response, the data transferred over the wire, and the data being
read by the client side can potentially be much more than the original file size (in this
example of 2 GB worth of original file it can potentially be close to 17 GB). This number
would increase further more with more number of connections. For larger file size the extent
of increase would be further higher too.
    In the approach used in this PR, for the above example, the total volume of data read
and transferred over the wire will be always limited to 2 GB and some extra KBs (for record
boundary resolution). This number will increase to a very less extent (still in KBs range)
for more number of connections. And this increment will not depend on file size. So if a big
volume of file (in GBs) has to be read with high number of connections in parallel the amount
of data being processed at server side, transferred over the wire, and read at client side
would be always limited to original file size and some extra KBs (for record boundary resolution).
    > Configuration
    Some configuration parameters are specific to remote servers that should be specified
by server not on connector level (some at server level may override connector level), i.e.
    Server level: 
    gateway path (assuming one Knox gateway per server)
    user name and password
    authentication method (think Kerberos etc)
    Connector level: 
    certificate validation options (maybe overridden by server level props)
    trustStore path
    webhdfs protocol version (maybe overridden by server level props)
    buffer sizes and file chunk sizes retry intervals etc
    You are right. However, I would put the 2 levels as Server Level and File Level. Some
parameters won't change from file to file - they are specific to a remote hdfs server and
therefore Server level parameters. Where as value of some parameters can be different from
file to file. These are File level parameters. The Server Level parameters are - Gateway Path,
User Name/Pasword, Webhdfs protocol version, Certificate Validation option (and other parameters
associated with that). Where as File Level parameters are buffer sizes, file chunks sizes
etc which can be different from File to File. 
    I don't see need for any property at connector level (the parameters which which would
be same across different remote hdfs servers accessed by the connector). All properties here
are related to either the nature of implementation of the remote HDFS server or the type of
file being accessed. Let me know if I'm missing out any aspect here.
    > Usability
    Given that users need to know about the remote Hadoop server configuration (security,
gateway path, etc) for WebHDFS access would it be nicer if ...
    users could separately configure server specific properties in a config file or registry
    and then in Spark jobs only use :/ without having to provide additional properties
    That's a good idea. We can have a set of default values for these parameters based on
typical practice/convention. However, those default values can be overwritten if specified
by user.
    > Security
    what authentication methods are supported besides basic auth (i.e. OAuth, Kerberos, ...)
    Right now this PR supports basic Auth at the Knox gateway level. Other authentication
mechanisms supported by Apache Knox (SAML, OAuth, CAS, OpenId) are not supported yet.
    Apache Knox complements support for Kerberized hadoop cluster (for nodes to communicate
and authenticate among themselves). That would be taken care of by Apache Knox transparently
through relevant configuration.
    > should the connector manage auth tokens, token renewal, etc
    No. It is internally handled by Apache Knox. 
    > I don't think the connector should create a truststore, either skip certificate validation
or take a user provided truststore path (btw, the current code fails to create a truststore
on Mac OS X)
    On a second thought I'm with you
    > Debugging
    the code should have logging at INFO, DEGUG, ERROR levels using the Spark logging mechanisms
(targeting the Spark log files)
    > Testing
    The outstanding unit tests should verify that the connector works with a ...
    standard Hadoop cluster (unsecured)
    This PR focuses only on secured Hadoop cluster. Unsecured hadoop cluster can be accessed
using existing webhdfs client library available from hadoop. So we don't need this.
    > Hadoop clusters secured by Apache Knox
    > Hadoop clusters secured by other mechanisms like Kerberos
    We need not as it is more a feature of Apache Knox. 

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

View raw message