Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 70979200C14 for ; Tue, 24 Jan 2017 00:36:12 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 6F339160B53; Mon, 23 Jan 2017 23:36:12 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B6234160B49 for ; Tue, 24 Jan 2017 00:36:11 +0100 (CET) Received: (qmail 67971 invoked by uid 500); 23 Jan 2017 23:36:10 -0000 Mailing-List: contact reviews-help@bahir.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@bahir.apache.org Delivered-To: mailing list reviews@bahir.apache.org Received: (qmail 67954 invoked by uid 99); 23 Jan 2017 23:36:10 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Jan 2017 23:36:10 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 79706DFBE6; Mon, 23 Jan 2017 23:36:10 +0000 (UTC) From: ckadner To: reviews@bahir.apache.org Reply-To: reviews@bahir.apache.org References: In-Reply-To: Subject: [GitHub] bahir issue #28: [BAHIR-75] [WIP] Remote HDFS connector for Apache Spark usi... Content-Type: text/plain Message-Id: <20170123233610.79706DFBE6@git1-us-west.apache.org> Date: Mon, 23 Jan 2017 23:36:10 +0000 (UTC) archived-at: Mon, 23 Jan 2017 23:36:12 -0000 Github user ckadner commented on the issue: https://github.com/apache/bahir/pull/28 A few high-level questions before jumping into more detailed code review: **Design** Can you elaborate on differences/limitations/advantages over Hadoop default "webhdfs" scheme? i.e. - the main problem you are working around it that the Hadoop WebHdfsFileSystem discards Knox gateway path when creating Http URL (principal motivation for this connector) which makes it impossible to use it with Knox - the Hadoop WebHdfsFileSystem implements additional interfaces like: - `DelegationTokenRenewer.Renewable` - `TokenAspect.TokenManagementDelegator` - performance differences between your approach vs Hadoop's _RemoteFS_ and _WebHDFS_ **Configuration** Some configuration parameters are specific to remote servers that should be specified by server not on connector level (some at server level may override connector level), i.e. - Server level: - gateway path (assuming one Knox gateway per server) - user name and password - authentication method (think Kerberos etc) - Connector level: - certificate validation options (maybe overridden by server level props) - trustStore path - webhdfs protocol version (maybe overridden by server level props) - buffer sizes and file chunk sizes retry intervals etc **Usability** Given that users need to know about the remote Hadoop server configuration (security, gateway path, etc) for WebHDFS access would it be nicer if ... - users could separately configure server specific properties in a config file or registry object - and then in Spark jobs only use :/ without having to provide additional properties **Security** - what authentication methods are supported besides basic auth (i.e. OAuth, Kerberos, ...) - should the connector manage auth tokens, token renewal, etc - I don't think the connector should create a truststore, either skip certificate validation or take a user provided truststore path (btw, the current code fails to create a truststore on Mac OS X) **Debugging** - the code should have logging at INFO, DEGUG, ERROR levels using the Spark logging mechanisms (targeting the Spark log files) **Testing** The outstanding unit tests should verify that the connector works with a ... - standard Hadoop cluster (unsecured) - Hadoop clusters secured by Apache Knox - Hadoop clusters secured by other mechanisms like Kerberos --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. ---