hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Lee <alee...@hotmail.com>
Subject Re: Hard Coded 0 to assign RPC Server port number when hive.execution.engine=spark
Date Wed, 21 Oct 2015 17:31:01 GMT
Hi Xuefu,

https://issues.apache.org/jira/browse/HIVE-12222

created. Please advise if the subject and the fields are appropriate and feel free to update
them to make it more standard for the community. I'll follow up in that JIRA ticket for discussion,
thanks.

________________________________________
From: Andrew Lee
Sent: Wednesday, October 21, 2015 10:25 AM
To: dev@hive.apache.org
Subject: Re: Hard Coded 0 to assign RPC Server port number when hive.execution.engine=spark

Hi Xuefu,

Thanks, I'll create a JIRA, by the way, since HiveCLI will be replaced by beeline or other
design later,
I'm hoping the same philosophy can be considered if other CLI is using RPCServer as well or
sharing the same source code at some point.

Shall the Issue Type of the JIRA ticket be "Improvement" or "New Feature" ?

________________________________________
From: Xuefu Zhang <xzhang@cloudera.com>
Sent: Tuesday, October 20, 2015 6:39 PM
To: dev@hive.apache.org
Subject: Re: Hard Coded 0 to assign RPC Server port number when hive.execution.engine=spark

Thanks, Andrew! You have a point. However, we're trying to sunset Hive CLI.
In the meantime, I guess it doesn't hurt to give admin more control over
the ports to be used. Please put your proposal in a JIRA and we can go from
there.

--Xuefu

On Tue, Oct 20, 2015 at 7:54 AM, Andrew Lee <alee526@hotmail.com> wrote:

> Hi Xuefu,
>
> 2 Main reasons.
>
> - Most users (what I see and encounter) use HiveCLI as a command line
> tool, and in order to use that, they need to login to the edge node (via
> SSH). Now, here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time
> to time. Most users will abuse the resource on that edge node (increasing
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python
> workflow, etc), this may cause the HS2 process to run into OOME, choke and
> die, etc. various resource issues including others like login, etc.
>
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly
> available. This makes sense to run it on the gateway node or a service node
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is
> easier to run HS2 with a daemon user account, etc. so we don't want users
> to run HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file
> handlers, disk space, issues.
>
> From a security standpoint,
>
> - Since users can login to edge node (via SSH), the security on the edge
> node needs to be fortified and enhanced. Therefore, all the FW comes in and
> auditing.
>
> - Regulation/compliance for auditing is another requirement to monitor all
> traffic, specifying ports and locking down the ports makes it easier since
> we can focus
> on a range to monitor and audit.
>
> Hope this explains the reason why we are asking for this feature.
>
>
> ________________________________________
> From: Xuefu Zhang <xzhang@cloudera.com>
> Sent: Monday, October 19, 2015 9:37 PM
> To: dev@hive.apache.org
> Subject: Re: Hard Coded 0 to assign RPC Server port number when
> hive.execution.engine=spark
>
> Hi Adrew,
>
> I understand your policy on edge node. However, I'm wondering why you
> cannot require that Hive CLI run only on gateway nodes, similar to HS2? In
> essence, Hive CLI is a client with embedded hive server, so it seems
> reasonable to have a similar requirement as it for HS2.
>
> I'm not defending against your request. Rather, I'm interested in the
> rationale behind your policy.
>
> Thanks,
> Xuefu
>
> On Mon, Oct 19, 2015 at 9:12 PM, Andrew Lee <alee526@hotmail.com> wrote:
>
> > Hi Xuefu,
> >
> > I agree for HS2 since HS2 usually runs on a gateway or service node
> inside
> > the cluster environment.
> > In my case, it is actually additional security.
> > A separate edge node (not running HS2, HS2 runs on another box) is used
> > for HiveCLI.
> > We don't allow data/worker nodes to talk to the edge node on random
> ports.
> > All ports must be registered or explicitly specified and monitored.
> > That's why I am asking for this feature. Otherwise, opening up 1024-65535
> > from data/worker node to edge node is actually
> > a bad idea and bad practice for network security.  :(
> >
> >
> >
> > ________________________________________
> > From: Xuefu Zhang <xzhang@cloudera.com>
> > Sent: Monday, October 19, 2015 1:12 PM
> > To: dev@hive.apache.org
> > Subject: Re: Hard Coded 0 to assign RPC Server port number when
> > hive.execution.engine=spark
> >
> > Hi Andrew,
> >
> > RpcServer is an instance launched for each user session. In case of Hive
> > CLI, which is for a single user, what you said makes sense and the port
> > number can be configurable. In the context of HS2, however, there are
> > multiple user sessions and the total is unknown in advance. While +1
> scheme
> > works, there can be still a band of ports that might be eventually
> opened.
> >
> > On a different perspective, we expect that either Hive CLI or HS2 resides
> > on a gateway node, which are in the same network with the data/worker
> > nodes. In this configuration, firewall issue you mentioned doesn't apply.
> > Such configuration is what we usually see in our enterprise customers,
> > which is what we recommend. I'm not sure why you would want your Hive
> users
> > to launch Hive CLI anywhere outside your cluster, which doesn't seem
> secure
> > if security is your concern.
> >
> > Thanks,
> > Xuefu
> >
> > On Mon, Oct 19, 2015 at 7:20 AM, Andrew Lee <alee526@hotmail.com> wrote:
> >
> > > Hi All,
> > >
> > >
> > > I notice that in
> > >
> > >
> > >
> > >
> >
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> > >
> > >
> > > The port number is assigned with 0 which means it will be a random port
> > > every time when the RPC Server is created
> > >
> > > to talk to Spark in the same session.
> > >
> > >
> > > Any reason why this port number is not a property to be configured and
> > > follow the same rule as +1 if the port is taken?
> > >
> > > Just like Spark's configuration for Spark Driver, etc.?  Because of
> this,
> > > this is causing problems to configure firewall between the
> > >
> > > HiveCLI RPC Server and Spark due to unpredictable port numbers here. In
> > > other word, users need to open all hive ports range
> > >
> > > from Data Node => HiveCLI (edge node).
> > >
> > >
> > >  this.channel = new ServerBootstrap()
> > >       .group(group)
> > >       .channel(NioServerSocketChannel.class)
> > >       .childHandler(new ChannelInitializer<SocketChannel>() {
> > >           @Override
> > >           public void initChannel(SocketChannel ch) throws Exception {
> > >             SaslServerHandler saslHandler = new
> > SaslServerHandler(config);
> > >             final Rpc newRpc = Rpc.createServer(saslHandler, config,
> ch,
> > > group);
> > >             saslHandler.rpc = newRpc;
> > >
> > >             Runnable cancelTask = new Runnable() {
> > >                 @Override
> > >                 public void run() {
> > >                   LOG.warn("Timed out waiting for hello from client.");
> > >                   newRpc.close();
> > >                 }
> > >             };
> > >             saslHandler.cancelTask = group.schedule(cancelTask,
> > >                 RpcServer.this.config.getServerConnectTimeoutMs(),
> > >                 TimeUnit.MILLISECONDS);
> > >
> > >           }
> > >       })
> > >       .option(ChannelOption.SO_BACKLOG, 1)
> > >       .option(ChannelOption.SO_REUSEADDR, true)
> > >       .childOption(ChannelOption.SO_KEEPALIVE, true)
> > >       .bind(0)
> > >       .sync()
> > >       .channel();
> > >     this.port = ((InetSocketAddress) channel.localAddress()).getPort();
> > >
> > >
> > > Appreciate any feedback, and if a JIRA is required to keep track of
> this
> > > conversation. Thanks.
> > >
> > >
> >
>
Mime
View raw message