hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Which [open-souce] SQL engine atop Hadoop?
Date Sat, 31 Jan 2015 22:32:29 GMT
1: "SQL constructs inside hive" <--use jdbc driver "describe table" read
result set
2: "use thrift"
3: web hcat
https://cwiki.apache.org/confluence/display/Hive/WebHCat+InstallWebHCat#WebHCatInstallWebHCat-WebHCatInstalledwithHive
4: Just go the mysql db that backs the metastore and query directly

That gives you 4 ways to get at hive's meta data.

>> "since backwards compatibility is... well lets just say lacking"
Welcome to open source software. Or all software in general really.

All I am getting at was there is 4 ways right there to get at the metadata.

>>"but how easy is it to do this with a secure hadoop/hive ecosystem? now i
need to handle kerberos myself and somehow pass tokens into thrift i
assume?"
Frankly I do not give a crud about the "secure bla bla" but I have seen
several tickets on thift/sasl so I assume someone does.

My only point was hive seems to give 4 ways to get at the metadata, which
is better then say mysql or vertica which only really gives you the option
to do #1 over jdbc.

Hive actually works with avro formats where it can read the schema from the
data https://cwiki.apache.org/confluence/display/Hive/AvroSerDe so that
other then pointing your "table" at a folder the metadata is magic. Which
is what you are basically describing.

So again it depends on your definition of easily accessible. But the fact
that I have a thrift API which I can use to walk through the tables in a
database seems more accessable than many other databases I am aware of.





On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <koert@tresata.com> wrote:

> edward,
> i would not call "SQL constructs inside hive" accessible for other
> systems. its inside hive after all
>
> it is true that i can contact the metastore in java using
> HiveMetaStoreClient, but then i need to bring in a whole slew of
> dependencies (the miniumum seems to be hive-metastore, hive-common,
> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
> error). these jars need to be "provided" and added to the classpath on the
> cluster, unless someone is willing to build versions of an application for
> every hive version out there. and even when you do all this you can only
> pray its going to be compatible with the next hive version, since backwards
> compatibility is... well lets just say lacking. the attitude seems to be
> that hive does not have a java api, so there is nothing that needs to be
> stable.
>
> you are right i could go the pure thrift road. i havent tried that yet.
> that might just be the best option. but how easy is it to do this with a
> secure hadoop/hive ecosystem? now i need to handle kerberos myself and
> somehow pass tokens into thrift i assume?
>
> contrast all of this with an avro file on hadoop with metadata baked in,
> and i think its safe to say hive metadata is not easily accessible.
>
> i will take a look at your book. i hope it has an example of using thrift
> on a secure cluster to contact hive metastore (without using the
> HiveMetaStoreClient), that would be awesome.
>
>
>
>
> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <edlinuxguru@gmail.com>
> wrote:
>
>> "with the metadata in a special metadata store (not on hdfs), and its not
>> as easy for all systems to access hive metadata." I disagree.
>>
>> Hives metadata is not only accessible through the SQL constructs like
>> "describe table". But the entire meta-store also is actually a thrift
>> service so you have programmatic access to determine things like what
>> columns are in a table etc. Thrift creates RPC clients for almost every
>> major language.
>>
>> In the programming hive book
>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>> there is even examples where I show how to iterate all the tables inside
>> the database from a java client.
>>
>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <koert@tresata.com>
>> wrote:
>>
>>> yes you can run whatever you like with the data in hdfs. keep in mind
>>> that hive makes this general access pattern just a little harder, since
>>> hive has a tendency to store data and metadata separately, with the
>>> metadata in a special metadata store (not on hdfs), and its not as easy for
>>> all systems to access hive metadata.
>>>
>>> i am not familiar at all with tajo or drill.
>>>
>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <samuelmarks@gmail.com>
>>> wrote:
>>>
>>>> Thanks for the advice
>>>>
>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>>
>>>> E.g.: GraphX, Mahout &etc.
>>>>
>>>> Also, what about Tajo or Drill?
>>>>
>>>> Best,
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>> PS: Spark-SQL is read-only IIRC, right?
>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <koert@tresata.com> wrote:
>>>>
>>>>> since you require high-powered analytics, and i assume you want to
>>>>> stay sane while doing so, you require the ability to "drop out of sql"
when
>>>>> needed. so spark-sql and lingual would be my choices.
>>>>>
>>>>> low latency indicates phoenix or spark-sql to me.
>>>>>
>>>>> so i would say spark-sql
>>>>>
>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <samuelmarks@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal
does open-source
>>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe
they
>>>>>> open source Pivotal HD: HAWQ.
>>>>>>
>>>>>> So that doesn't meet my requirements. I should note that the project
>>>>>> I am building will also be open-source, which heightens the importance
of
>>>>>> having all components also being open-source.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Samuel Marks
>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>
>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>>> siddharth.tiwari@live.com> wrote:
>>>>>>
>>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <samuelmarks@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Since Hadoop <https://hive.apache.org> came out, there
have been
>>>>>>> various commercial and/or open-source attempts to expose some
compatibility
>>>>>>> with SQL <http://drill.apache.org>. Obviously by posting
here I am
>>>>>>> not expecting an unbiased answer.
>>>>>>>
>>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>>>> querying, and supports the most common CRUD
>>>>>>> <https://spark.apache.org>, including [the basics!] along
these
>>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table
SET
>>>>>>> C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
>>>>>>> would be nice also, but is not a must-have.
>>>>>>>
>>>>>>> Essentially I want a full replacement for the more traditional
>>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
>>>>>>>
>>>>>>> Python is my language of choice for interfacing, however there
does
>>>>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>>>>
>>>>>>> Here is what I've found thus far:
>>>>>>>
>>>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>>>    <https://spark.apache.org/sql>, queries only, add data
via Hive,
>>>>>>>    RDD
>>>>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>>>    or Paraquet <http://parquet.io/>)
>>>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built
atop Apache
>>>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>>>    <http://en.wikipedia.org/wiki/Database_transaction>
support, relational
>>>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators>
>>>>>>>    and some built-in functions)
>>>>>>>    - Cloudera Impala
>>>>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>>>    (significant HiveQL support, some SQL language support, no
support for
>>>>>>>    indexes on its tables, importantly missing DELETE, UPDATE
and INTERSECT;
>>>>>>>    amongst others)
>>>>>>>    - Presto <https://github.com/facebook/presto> from Facebook
(can
>>>>>>>    query Hive, Cassandra <http://cassandra.apache.org>,
relational
>>>>>>>    DBs &etc. Doesn't seem to be designed for low-latency
responses across
>>>>>>>    small clusters, or support UPDATE operations. It is optimized
>>>>>>>    for data warehousing or analytics¹
>>>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop>
via MapR
>>>>>>>    community edition <https://www.mapr.com/products/hadoop-download>
>>>>>>>    (seems to be a packaging of Hive, HP Vertica
>>>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides
an SQL
>>>>>>>    interface and multi-dimensional analysis [OLAP
>>>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI
SQL on
>>>>>>>    Hadoop and supports most ANSI SQL query functions". It depends
on HDFS,
>>>>>>>    MapReduce, Hive and HBase; and seems targeted at very large
data-sets
>>>>>>>    though maintains low query latency)
>>>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL
standard
>>>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC>
driver
>>>>>>>    support [benchmarks against Hive and Impala
>>>>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>>>    ])
>>>>>>>    - Cascading
>>>>>>>    <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
Lingual
>>>>>>>    <http://docs.cascading.org/lingual/1.0/>²
>>>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support>
("Lingual
>>>>>>>    provides JDBC Drivers, a SQL command shell, and a catalog
manager for
>>>>>>>    publishing files [or any resource] as schemas and tables.")
>>>>>>>
>>>>>>> Which—from this list or elsewhere—would you recommend, and
why?
>>>>>>> Thanks for all suggestions,
>>>>>>>
>>>>>>> Samuel Marks
>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Mime
View raw message