hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <>
Subject [jira] [Commented] (HIVE-10511) Replacing the implementation of Hive CLI using Beeline
Date Mon, 13 Feb 2017 21:17:42 GMT


Gopal V commented on HIVE-10511:

[~gss2002]: The big issue with HS2 was primarily with MapredLocalTask (if you're going by
the Cloudera docs), which effectively burns down the CPU on the HS2 box and both uploads/downloads
data to HDFS to run joins.

Currently, I'm doing ~100 concurrent queries per 16Gb  HS2 with LLAP (so, approx ~250-500
sessions per box on Tableau). And some part of it needs to improve, particularly when moving
>10k rows per query.

bq. how do you plan on stopping folks on using sparkSQL cli as it goes directly at metastore
and fs 


Look, we're not in the business of stopping users from doing what they want - we're not going
to go down that way.

However, some admins and business owners are. When dealing with some groups of users, the
fact that a SQL user can't just "copy all this data to my laptop and sell it somewhere" is
an advantage.

Current solutions (where filesystem is the only permission level) involve maintaining different
copies of data to keep it safe with raw file permissions. Imagine GPS pickup/dropoff, billing
address, CC # and real-name in a db - the pricing analysis guys need the first three, the
billing folks need the last 3 etc. This is insanity when it comes to ETL scheduling and keeping
all parts of the system in sync - so people who go down the "maintain different copies" path
will be carrying a pager daily.

Not all of that data processing is SQL, at least not the geo-location or clustering, so in
my view, Hive (as a system of record) needs to make sure Spark is not left out of the workflows.

> Replacing the implementation of Hive CLI using Beeline
> ------------------------------------------------------
>                 Key: HIVE-10511
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: CLI
>    Affects Versions: 0.10.0
>            Reporter: Xuefu Zhang
>            Assignee: Ferdinand Xu
> Hive CLI is a legacy tool which had two main use cases: 
> 1. a thick client for SQL on hadoop
> 2. a command line tool for HiveServer1.
> HiveServer1 is already deprecated and removed from Hive code base, so  use case #2 is
out of the question. For #1, Beeline provides or is supposed to provides equal functionality,
yet is implemented differently from Hive CLI.
> As it has been a while that Hive community has been recommending Beeline + HS2 configuration,
ideally we should deprecating Hive CLI. Because of wide use of Hive CLI, we instead propose
replacing Hive CLI's implementation with Beeline plus embedded HS2 so that Hive community
only needs to maintain a single code path. In this way, Hive CLI is just an alias to Beeline
at either shell script level or at high code level. The goal is that  no changes or minimum
changes are expected from existing user scrip using Hive CLI.
> This is an Umbrella JIRA covering all tasks related to this initiative. Over the last
year or two, Beeline has been improved significantly to match what Hive CLI offers. Still,
there may still be some gaps or deficiency to be discovered and fixed. In the meantime, we
also want to make sure the enough tests are included and performance impact is identified
and addressed.

This message was sent by Atlassian JIRA

View raw message