hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <>
Subject Re: Which [open-souce] SQL engine atop Hadoop?
Date Sat, 31 Jan 2015 22:32:29 GMT
1: "SQL constructs inside hive" <--use jdbc driver "describe table" read
result set
2: "use thrift"
3: web hcat
4: Just go the mysql db that backs the metastore and query directly

That gives you 4 ways to get at hive's meta data.

>> "since backwards compatibility is... well lets just say lacking"
Welcome to open source software. Or all software in general really.

All I am getting at was there is 4 ways right there to get at the metadata.

>>"but how easy is it to do this with a secure hadoop/hive ecosystem? now i
need to handle kerberos myself and somehow pass tokens into thrift i
Frankly I do not give a crud about the "secure bla bla" but I have seen
several tickets on thift/sasl so I assume someone does.

My only point was hive seems to give 4 ways to get at the metadata, which
is better then say mysql or vertica which only really gives you the option
to do #1 over jdbc.

Hive actually works with avro formats where it can read the schema from the
data so that
other then pointing your "table" at a folder the metadata is magic. Which
is what you are basically describing.

So again it depends on your definition of easily accessible. But the fact
that I have a thrift API which I can use to walk through the tables in a
database seems more accessable than many other databases I am aware of.

On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <> wrote:

> edward,
> i would not call "SQL constructs inside hive" accessible for other
> systems. its inside hive after all
> it is true that i can contact the metastore in java using
> HiveMetaStoreClient, but then i need to bring in a whole slew of
> dependencies (the miniumum seems to be hive-metastore, hive-common,
> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
> error). these jars need to be "provided" and added to the classpath on the
> cluster, unless someone is willing to build versions of an application for
> every hive version out there. and even when you do all this you can only
> pray its going to be compatible with the next hive version, since backwards
> compatibility is... well lets just say lacking. the attitude seems to be
> that hive does not have a java api, so there is nothing that needs to be
> stable.
> you are right i could go the pure thrift road. i havent tried that yet.
> that might just be the best option. but how easy is it to do this with a
> secure hadoop/hive ecosystem? now i need to handle kerberos myself and
> somehow pass tokens into thrift i assume?
> contrast all of this with an avro file on hadoop with metadata baked in,
> and i think its safe to say hive metadata is not easily accessible.
> i will take a look at your book. i hope it has an example of using thrift
> on a secure cluster to contact hive metastore (without using the
> HiveMetaStoreClient), that would be awesome.
> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <>
> wrote:
>> "with the metadata in a special metadata store (not on hdfs), and its not
>> as easy for all systems to access hive metadata." I disagree.
>> Hives metadata is not only accessible through the SQL constructs like
>> "describe table". But the entire meta-store also is actually a thrift
>> service so you have programmatic access to determine things like what
>> columns are in a table etc. Thrift creates RPC clients for almost every
>> major language.
>> In the programming hive book
>> there is even examples where I show how to iterate all the tables inside
>> the database from a java client.
>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <>
>> wrote:
>>> yes you can run whatever you like with the data in hdfs. keep in mind
>>> that hive makes this general access pattern just a little harder, since
>>> hive has a tendency to store data and metadata separately, with the
>>> metadata in a special metadata store (not on hdfs), and its not as easy for
>>> all systems to access hive metadata.
>>> i am not familiar at all with tajo or drill.
>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <>
>>> wrote:
>>>> Thanks for the advice
>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>> E.g.: GraphX, Mahout &etc.
>>>> Also, what about Tajo or Drill?
>>>> Best,
>>>> Samuel Marks
>>>> PS: Spark-SQL is read-only IIRC, right?
>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <> wrote:
>>>>> since you require high-powered analytics, and i assume you want to
>>>>> stay sane while doing so, you require the ability to "drop out of sql"
>>>>> needed. so spark-sql and lingual would be my choices.
>>>>> low latency indicates phoenix or spark-sql to me.
>>>>> so i would say spark-sql
>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <>
>>>>> wrote:
>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal
does open-source
>>>>>> a lot of software <>, I don't believe
>>>>>> open source Pivotal HD: HAWQ.
>>>>>> So that doesn't meet my requirements. I should note that the project
>>>>>> I am building will also be open-source, which heightens the importance
>>>>>> having all components also being open-source.
>>>>>> Cheers,
>>>>>> Samuel Marks
>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>>>> wrote:
>>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>> Sent from my iPhone
>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <>
>>>>>>> wrote:
>>>>>>> Since Hadoop <> came out, there
have been
>>>>>>> various commercial and/or open-source attempts to expose some
>>>>>>> with SQL <>. Obviously by posting
here I am
>>>>>>> not expecting an unbiased answer.
>>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>>>> querying, and supports the most common CRUD
>>>>>>> <>, including [the basics!] along
>>>>>>> C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
>>>>>>> would be nice also, but is not a must-have.
>>>>>>> Essentially I want a full replacement for the more traditional
>>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
>>>>>>> Python is my language of choice for interfacing, however there
>>>>>>> seem to be a Python JDBC wrapper <>.
>>>>>>> Here is what I've found thus far:
>>>>>>>    - Apache Hive <> (SQL-like, with
>>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>>    - Apache Drill <> (ANSI SQL support)
>>>>>>>    - Apache Spark <> (Spark SQL
>>>>>>>    <>, queries only, add data
via Hive,
>>>>>>>    RDD
>>>>>>>    <>
>>>>>>>    or Paraquet <>)
>>>>>>>    - Apache Phoenix <> (built
atop Apache
>>>>>>>    HBase <>, lacks full transaction
>>>>>>>    <>
support, relational
>>>>>>>    operators <>
>>>>>>>    and some built-in functions)
>>>>>>>    - Cloudera Impala
>>>>>>>    <>
>>>>>>>    (significant HiveQL support, some SQL language support, no
support for
>>>>>>>    indexes on its tables, importantly missing DELETE, UPDATE
>>>>>>>    amongst others)
>>>>>>>    - Presto <> from Facebook
>>>>>>>    query Hive, Cassandra <>,
>>>>>>>    DBs &etc. Doesn't seem to be designed for low-latency
responses across
>>>>>>>    small clusters, or support UPDATE operations. It is optimized
>>>>>>>    for data warehousing or analytics¹
>>>>>>>    <>)
>>>>>>>    - SQL-Hadoop <>
via MapR
>>>>>>>    community edition <>
>>>>>>>    (seems to be a packaging of Hive, HP Vertica
>>>>>>>    <>,
>>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>>    <>)
>>>>>>>    - Apache Kylin <> from Ebay (provides
an SQL
>>>>>>>    interface and multi-dimensional analysis [OLAP
>>>>>>>    <>], "… offers ANSI
SQL on
>>>>>>>    Hadoop and supports most ANSI SQL query functions". It depends
on HDFS,
>>>>>>>    MapReduce, Hive and HBase; and seems targeted at very large
>>>>>>>    though maintains low query latency)
>>>>>>>    - Apache Tajo <> (ANSI/ISO SQL
>>>>>>>    compliance with JDBC <>
>>>>>>>    support [benchmarks against Hive and Impala
>>>>>>>    <>
>>>>>>>    ])
>>>>>>>    - Cascading
>>>>>>>    <>'s
>>>>>>>    <>²
>>>>>>>    <>
>>>>>>>    provides JDBC Drivers, a SQL command shell, and a catalog
manager for
>>>>>>>    publishing files [or any resource] as schemas and tables.")
>>>>>>> Which—from this list or elsewhere—would you recommend, and
>>>>>>> Thanks for all suggestions,
>>>>>>> Samuel Marks

View raw message