hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samuel Marks <samuelma...@gmail.com>
Subject Re: Which [open-souce] SQL engine atop Hadoop?
Date Sun, 01 Feb 2015 01:56:06 GMT
Interesting discussion. It looks like the HBase metastore can also be
configured to use HDFS HA (ex. tutorial
<http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_hag_hdfs_ha_cdh_components_config.html>
).

To get back on topic though, the primary contenders now are: Phoenix,
Lingual and perhaps Tajo or Drill?

Best,

Samuel Marks
http://linkedin.com/in/samuelmarks

On Sun, Feb 1, 2015 at 9:38 AM, Edward Capriolo <edlinuxguru@gmail.com>
wrote:

> "is the metastore thrift definition stable across hive versions?" I would
> say yes. Like many API's the core eventually solidifies. No one is saying
> it will never every change, but basically there are things like "database"
> and "table" and they have properties like "name". I have some basic scripts
> that look for table names matching patterns or summarize disk usage by
> owner. I have not had to touch them very much. Usually if they do change it
> is something small and if you tie the commit to a jira you can figure out
> what and why.
>
> On Sat, Jan 31, 2015 at 3:02 PM, Koert Kuipers <koert@tresata.com> wrote:
>
>> seems the metastore thrift service support SASL. thats great. so if i
>> understand it correctly all i need is the metastore thrift definition to
>> query the metastore.
>> is the metastore thrift definition stable across hive versions? if so,
>> then i can build my app once without worrying about the hive version
>> deployed. in that case i admit its not as bad as i thought. lets see!
>>
>> On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> oh sorry edward, i misread you post. seems we agree that "SQL constructs
>>> inside hive" are not for other systems.
>>>
>>> On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <koert@tresata.com>
>>> wrote:
>>>
>>>> edward,
>>>> i would not call "SQL constructs inside hive" accessible for other
>>>> systems. its inside hive after all
>>>>
>>>> it is true that i can contact the metastore in java using
>>>> HiveMetaStoreClient, but then i need to bring in a whole slew of
>>>> dependencies (the miniumum seems to be hive-metastore, hive-common,
>>>> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
>>>> error). these jars need to be "provided" and added to the classpath on the
>>>> cluster, unless someone is willing to build versions of an application for
>>>> every hive version out there. and even when you do all this you can only
>>>> pray its going to be compatible with the next hive version, since backwards
>>>> compatibility is... well lets just say lacking. the attitude seems to be
>>>> that hive does not have a java api, so there is nothing that needs to be
>>>> stable.
>>>>
>>>> you are right i could go the pure thrift road. i havent tried that yet.
>>>> that might just be the best option. but how easy is it to do this with a
>>>> secure hadoop/hive ecosystem? now i need to handle kerberos myself and
>>>> somehow pass tokens into thrift i assume?
>>>>
>>>> contrast all of this with an avro file on hadoop with metadata baked
>>>> in, and i think its safe to say hive metadata is not easily accessible.
>>>>
>>>> i will take a look at your book. i hope it has an example of using
>>>> thrift on a secure cluster to contact hive metastore (without using the
>>>> HiveMetaStoreClient), that would be awesome.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <edlinuxguru@gmail.com
>>>> > wrote:
>>>>
>>>>> "with the metadata in a special metadata store (not on hdfs), and its
>>>>> not as easy for all systems to access hive metadata." I disagree.
>>>>>
>>>>> Hives metadata is not only accessible through the SQL constructs like
>>>>> "describe table". But the entire meta-store also is actually a thrift
>>>>> service so you have programmatic access to determine things like what
>>>>> columns are in a table etc. Thrift creates RPC clients for almost every
>>>>> major language.
>>>>>
>>>>> In the programming hive book
>>>>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>>>>> there is even examples where I show how to iterate all the tables inside
>>>>> the database from a java client.
>>>>>
>>>>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <koert@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> yes you can run whatever you like with the data in hdfs. keep in
mind
>>>>>> that hive makes this general access pattern just a little harder,
since
>>>>>> hive has a tendency to store data and metadata separately, with the
>>>>>> metadata in a special metadata store (not on hdfs), and its not as
easy for
>>>>>> all systems to access hive metadata.
>>>>>>
>>>>>> i am not familiar at all with tajo or drill.
>>>>>>
>>>>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <samuelmarks@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the advice
>>>>>>>
>>>>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>>>>> can't I just run whatever complex tools I'm whichever paradigm
they like?
>>>>>>>
>>>>>>> E.g.: GraphX, Mahout &etc.
>>>>>>>
>>>>>>> Also, what about Tajo or Drill?
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Samuel Marks
>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>
>>>>>>> PS: Spark-SQL is read-only IIRC, right?
>>>>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <koert@tresata.com>
wrote:
>>>>>>>
>>>>>>>> since you require high-powered analytics, and i assume you
want to
>>>>>>>> stay sane while doing so, you require the ability to "drop
out of sql" when
>>>>>>>> needed. so spark-sql and lingual would be my choices.
>>>>>>>>
>>>>>>>> low latency indicates phoenix or spark-sql to me.
>>>>>>>>
>>>>>>>> so i would say spark-sql
>>>>>>>>
>>>>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <
>>>>>>>> samuelmarks@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI
92) and
>>>>>>>>> exposing both JDBC and ODBC interfaces. However, although
Pivotal does open-source
>>>>>>>>> a lot of software <http://www.pivotal.io/oss>,
I don't believe
>>>>>>>>> they open source Pivotal HD: HAWQ.
>>>>>>>>>
>>>>>>>>> So that doesn't meet my requirements. I should note that
the
>>>>>>>>> project I am building will also be open-source, which
heightens the
>>>>>>>>> importance of having all components also being open-source.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Samuel Marks
>>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>>
>>>>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>>>>>> siddharth.tiwari@live.com> wrote:
>>>>>>>>>
>>>>>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>>>>>
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>
>>>>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <samuelmarks@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Since Hadoop <https://hive.apache.org> came
out, there have been
>>>>>>>>>> various commercial and/or open-source attempts to
expose some compatibility
>>>>>>>>>> with SQL <http://drill.apache.org>. Obviously
by posting here I
>>>>>>>>>> am not expecting an unbiased answer.
>>>>>>>>>>
>>>>>>>>>> Seeking an SQL-on-Hadoop offering which provides:
low-latency
>>>>>>>>>> querying, and supports the most common CRUD
>>>>>>>>>> <https://spark.apache.org>, including [the
basics!] along these
>>>>>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM,
UPDATE Table
>>>>>>>>>> SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional
>>>>>>>>>> support would be nice also, but is not a must-have.
>>>>>>>>>>
>>>>>>>>>> Essentially I want a full replacement for the more
traditional
>>>>>>>>>> RDBMS, one which can scale from 1 node to a serious
Hadoop cluster.
>>>>>>>>>>
>>>>>>>>>> Python is my language of choice for interfacing,
however there
>>>>>>>>>> does seem to be a Python JDBC wrapper
>>>>>>>>>> <https://spark.apache.org/sql>.
>>>>>>>>>>
>>>>>>>>>> Here is what I've found thus far:
>>>>>>>>>>
>>>>>>>>>>    - Apache Hive <https://hive.apache.org>
(SQL-like, with
>>>>>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>>>>>    - Apache Drill <http://drill.apache.org>
(ANSI SQL support)
>>>>>>>>>>    - Apache Spark <https://spark.apache.org>
(Spark SQL
>>>>>>>>>>    <https://spark.apache.org/sql>, queries
only, add data via
>>>>>>>>>>    Hive, RDD
>>>>>>>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>>>>>>    or Paraquet <http://parquet.io/>)
>>>>>>>>>>    - Apache Phoenix <http://phoenix.apache.org>
(built atop Apache
>>>>>>>>>>    HBase <http://hbase.apache.org>, lacks full
transaction
>>>>>>>>>>    <http://en.wikipedia.org/wiki/Database_transaction>
support, relational
>>>>>>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators>
>>>>>>>>>>    and some built-in functions)
>>>>>>>>>>    - Cloudera Impala
>>>>>>>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>>>>>>    (significant HiveQL support, some SQL language
support, no support for
>>>>>>>>>>    indexes on its tables, importantly missing DELETE,
UPDATE and INTERSECT;
>>>>>>>>>>    amongst others)
>>>>>>>>>>    - Presto <https://github.com/facebook/presto>
from Facebook
>>>>>>>>>>    (can query Hive, Cassandra <http://cassandra.apache.org>,
>>>>>>>>>>    relational DBs &etc. Doesn't seem to be designed
for low-latency responses
>>>>>>>>>>    across small clusters, or support UPDATE operations.
It is
>>>>>>>>>>    optimized for data warehousing or analytics¹
>>>>>>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop>
via MapR
>>>>>>>>>>    community edition
>>>>>>>>>>    <https://www.mapr.com/products/hadoop-download>
(seems to be
>>>>>>>>>>    a packaging of Hive, HP Vertica
>>>>>>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>>>>>>    - Apache Kylin <http://www.kylin.io> from
Ebay (provides an
>>>>>>>>>>    SQL interface and multi-dimensional analysis [OLAP
>>>>>>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "…
offers ANSI SQL on
>>>>>>>>>>    Hadoop and supports most ANSI SQL query functions".
It depends on HDFS,
>>>>>>>>>>    MapReduce, Hive and HBase; and seems targeted
at very large data-sets
>>>>>>>>>>    though maintains low query latency)
>>>>>>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO
SQL standard
>>>>>>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC>
>>>>>>>>>>    driver support [benchmarks against Hive and Impala
>>>>>>>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>>>>>>    ])
>>>>>>>>>>    - Cascading
>>>>>>>>>>    <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>>>>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>>>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support>
>>>>>>>>>>    ("Lingual provides JDBC Drivers, a SQL command
shell, and a catalog manager
>>>>>>>>>>    for publishing files [or any resource] as schemas
and tables.")
>>>>>>>>>>
>>>>>>>>>> Which—from this list or elsewhere—would you recommend,
and why?
>>>>>>>>>> Thanks for all suggestions,
>>>>>>>>>>
>>>>>>>>>> Samuel Marks
>>>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message