hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: Which [open-souce] SQL engine atop Hadoop?
Date Sat, 31 Jan 2015 16:05:57 GMT
yes you can run whatever you like with the data in hdfs. keep in mind that
hive makes this general access pattern just a little harder, since hive has
a tendency to store data and metadata separately, with the metadata in a
special metadata store (not on hdfs), and its not as easy for all systems
to access hive metadata.

i am not familiar at all with tajo or drill.

On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <samuelmarks@gmail.com> wrote:

> Thanks for the advice
>
> Koert: when everything is in the same essential data-store (HDFS), can't I
> just run whatever complex tools I'm whichever paradigm they like?
>
> E.g.: GraphX, Mahout &etc.
>
> Also, what about Tajo or Drill?
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> PS: Spark-SQL is read-only IIRC, right?
> On 31 Jan 2015 03:39, "Koert Kuipers" <koert@tresata.com> wrote:
>
>> since you require high-powered analytics, and i assume you want to stay
>> sane while doing so, you require the ability to "drop out of sql" when
>> needed. so spark-sql and lingual would be my choices.
>>
>> low latency indicates phoenix or spark-sql to me.
>>
>> so i would say spark-sql
>>
>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <samuelmarks@gmail.com>
>> wrote:
>>
>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does open-source
>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>> open source Pivotal HD: HAWQ.
>>>
>>> So that doesn't meet my requirements. I should note that the project I
>>> am building will also be open-source, which heightens the importance of
>>> having all components also being open-source.
>>>
>>> Cheers,
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>> siddharth.tiwari@live.com> wrote:
>>>
>>>> Have you looked at HAWQ from Pivotal ?
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <samuelmarks@gmail.com>
>>>> wrote:
>>>>
>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>> various commercial and/or open-source attempts to expose some compatibility
>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am
not
>>>> expecting an unbiased answer.
>>>>
>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
>>>> and supports the most common CRUD <https://spark.apache.org>,
>>>> including [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT
>>>> * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>>> Transactional support would be nice also, but is not a must-have.
>>>>
>>>> Essentially I want a full replacement for the more traditional RDBMS,
>>>> one which can scale from 1 node to a serious Hadoop cluster.
>>>>
>>>> Python is my language of choice for interfacing, however there does
>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>
>>>> Here is what I've found thus far:
>>>>
>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>>>>    SQL thanks to the Stinger initiative)
>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>    <https://spark.apache.org/sql>, queries only, add data via Hive,
RDD
>>>>    <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>    or Paraquet <http://parquet.io/>)
>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>>>    some built-in functions)
>>>>    - Cloudera Impala
>>>>    <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>>>    amongst others)
>>>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>>>    query Hive, Cassandra <http://cassandra.apache.org>, relational
DBs
>>>>    &etc. Doesn't seem to be designed for low-latency responses across
small
>>>>    clusters, or support UPDATE operations. It is optimized for data
>>>>    warehousing or analytics¹
>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>    community edition <https://www.mapr.com/products/hadoop-download>
>>>>    (seems to be a packaging of Hive, HP Vertica
>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>>>>    Drill and a native ODBC wrapper
>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>>    interface and multi-dimensional analysis [OLAP
>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>>>>    and supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>>>>    Hive and HBase; and seems targeted at very large data-sets though maintains
>>>>    low query latency)
>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>>    support [benchmarks against Hive and Impala
>>>>    <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>    ])
>>>>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>>    publishing files [or any resource] as schemas and tables.")
>>>>
>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>> Thanks for all suggestions,
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>>
>>>
>>

Mime
View raw message