hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samuel Marks <>
Subject Which [open-souce] SQL engine atop Hadoop?
Date Fri, 30 Jan 2015 11:26:31 GMT
Since Hadoop <> came out, there have been various
commercial and/or open-source attempts to expose some compatibility with SQL
<>. Obviously by posting here I am not expecting an
unbiased answer.

Seeking an SQL-on-Hadoop offering which provides: low-latency querying, and
supports the most common CRUD <>, including [the
basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE
Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
would be nice also, but is not a must-have.

Essentially I want a full replacement for the more traditional RDBMS, one
which can scale from 1 node to a serious Hadoop cluster.

Python is my language of choice for interfacing, however there does seem to
be a Python JDBC wrapper <>.

Here is what I've found thus far:

   - Apache Hive <> (SQL-like, with interactive SQL
   thanks to the Stinger initiative)
   - Apache Drill <> (ANSI SQL support)
   - Apache Spark <> (Spark SQL
   <>, queries only, add data via Hive, RDD
   or Paraquet <>)
   - Apache Phoenix <> (built atop Apache HBase
   <>, lacks full transaction
   <> support, relational
   operators <> and some
   built-in functions)
   - Cloudera Impala
   (significant HiveQL support, some SQL language support, no support for
   indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
   amongst others)
   - Presto <> from Facebook (can query
   Hive, Cassandra <>, relational DBs &etc.
   Doesn't seem to be designed for low-latency responses across small
   clusters, or support UPDATE operations. It is optimized for data
   warehousing or analytics¹
   - SQL-Hadoop <> via MapR
   community edition <> (seems
   to be a packaging of Hive, HP Vertica
   <>, SparkSQL,
   Drill and a native ODBC wrapper
   - Apache Kylin <> from Ebay (provides an SQL
   interface and multi-dimensional analysis [OLAP
   <>], "… offers ANSI SQL on Hadoop and
   supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
   Hive and HBase; and seems targeted at very large data-sets though maintains
   low query latency)
   - Apache Tajo <> (ANSI/ISO SQL standard compliance
   with JDBC <> driver support [benchmarks
   against Hive and Impala
   - Cascading <>'s
   Lingual <>²
   <> ("Lingual provides
   JDBC Drivers, a SQL command shell, and a catalog manager for publishing
   files [or any resource] as schemas and tables.")

Which—from this list or elsewhere—would you recommend, and why?
Thanks for all suggestions,

Samuel Marks

View raw message