Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of samuelmarks@gmail.com
 designates 209.85.213.49 as permitted sender)
MIME-Version: 1.0
Date: Tue, 27 Jan 2015 01:19:52 +1100
Message-ID: 
 <CAMfPbcYPNNrgWozAm2BLGg4aqg++EiHup6XYjhMD1Jk5kvKeww@mail.gmail.com>
Subject: Which [open-souce] SQL engine atop Hadoop?
From: Samuel Marks <samuelmarks@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a1137c746f702af050d8ed4b6

--001a1137c746f702af050d8ed4b6
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Since Hadoop <https://hive.apache.org> came out, there have been various
commercial and/or open-source attempts to expose some compatibility with SQ=
L
<http://drill.apache.org>.

I am seeking one which is good for low-latency querying, and supports the
most common CRUD <https://spark.apache.org>, including [the basics!] along
these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET
C1=3D2 WHERE, DELETE FROM, and DROP TABLE.

I will be utilising them from Python, however there does seem to be a Pytho=
n
JDBC wrapper <https://spark.apache.org/sql>. Additionally it needs to be
scalable for big and small data (starting on a single-node "cluster").

Here is what I've found thus far:

   - Apache Hive <https://hive.apache.org> (SQL-like, with interactive SQL
   thanks to the Stinger initiative)
   - Apache Drill <http://drill.apache.org> (ANSI SQL support)
   - Apache Spark <https://spark.apache.org> (Spark SQL
   <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
   <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.sp=
ark.sql.SchemaRDD>
   or Paraquet <http://parquet.io/>)
   - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
   <http://hbase.apache.org>, lacks full transaction
   <http://en.wikipedia.org/wiki/Database_transaction> support, relational
   operators <http://en.wikipedia.org/wiki/Relational_operators> and some
   built-in functions)
   - Presto <https://github.com/facebook/presto> from Facebook (can query
   Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
   Doesn't seem to be designed for low-latency responses across small
   clusters, or support UPDATE operations. It is optimized for data
   warehousing or analytics=C2=B9
   <http://prestodb.io/docs/current/overview/use-cases.html>)
   - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
   community edition <https://www.mapr.com/products/hadoop-download> (seems
   to be a packaging of Hive, HP Vertica
   <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
   Drill and a native ODBC wrapper
   <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
   - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
   interface and multi-dimensional analysis [OLAP
   <http://en.wikipedia.org/wiki/OLAP>], "=E2=80=A6 offers ANSI SQL on Hado=
op and
   supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
   Hive and HBase; and seems targeted at very large data-sets though mainta=
ins
   low query latency)
   - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard compliance
   with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support [benchmarks
   against Hive and Impala
   <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-h=
adoop-space>
   ])
   - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
   Lingual <http://docs.cascading.org/lingual/1.0/>=C2=B2
   <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual provides
   JDBC Drivers, a SQL command shell, and a catalog manager for publishing
   files [or any resource] as schemas and tables.")

Which=E2=80=94from this list or elsewhere=E2=80=94would you recommend, and =
why?
Thanks for all suggestions,

Samuel Marks
http://linkedin.com/in/samuelmarks

--001a1137c746f702af050d8ed4b6
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"" itemprop=3D"text">

        <p>Since <a href=3D"https://hive.apache.org" rel=3D"nofollow">Hadoo=
p</a> came out, there have been various commercial and/or open-source attem=
pts to expose some compatibility with <a href=3D"http://drill.apache.org" r=
el=3D"nofollow">SQL</a>.</p>

<p>I am seeking one which is good for low-latency querying, and supports th=
e most common <a href=3D"https://spark.apache.org" rel=3D"nofollow">CRUD</a=
>, including [the basics!] along these lines: <code>CREATE TABLE</code>, <c=
ode>INSERT INTO</code>, <code>SELECT * FROM</code>, <code>UPDATE Table SET =
C1=3D2 WHERE</code>, <code>DELETE FROM</code>, and <code>DROP TABLE</code>.=
</p>

<p>I will be utilising them from Python, however there does seem to be a <a=
 href=3D"https://spark.apache.org/sql" rel=3D"nofollow">Python JDBC wrapper=
</a>. Additionally it needs to be scalable for big and small data (starting=
 on a single-node &quot;cluster&quot;).</p>

<p>Here is what I&#39;ve found thus far:</p>

<ul><li><a href=3D"https://hive.apache.org" rel=3D"nofollow">Apache Hive</a=
> (SQL-like, with interactive SQL thanks to the Stinger initiative)</li><li=
><a href=3D"http://drill.apache.org" rel=3D"nofollow">Apache Drill</a> (ANS=
I SQL support)</li><li><a href=3D"https://spark.apache.org" rel=3D"nofollow=
">Apache Spark</a> (<a href=3D"https://spark.apache.org/sql" rel=3D"nofollo=
w">Spark SQL</a>, queries only, add data via Hive, <a href=3D"https://spark=
.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD=
" rel=3D"nofollow">RDD</a> or <a href=3D"http://parquet.io/" rel=3D"nofollo=
w">Paraquet</a>)</li><li><a href=3D"http://phoenix.apache.org" rel=3D"nofol=
low">Apache Phoenix</a> (built atop <a href=3D"http://hbase.apache.org" rel=
=3D"nofollow">Apache HBase</a>, lacks full <a href=3D"http://en.wikipedia.o=
rg/wiki/Database_transaction" rel=3D"nofollow">transaction</a> support, <a =
href=3D"http://en.wikipedia.org/wiki/Relational_operators" rel=3D"nofollow"=
>relational operators</a> and some built-in functions)</li><li><a href=3D"h=
ttps://github.com/facebook/presto" rel=3D"nofollow">Presto</a> from Faceboo=
k (can query Hive, <a href=3D"http://cassandra.apache.org" rel=3D"nofollow"=
>Cassandra</a>, relational DBs &amp;etc. Doesn&#39;t seem to be designed fo=
r low-latency responses across small clusters, or support <code>UPDATE</cod=
e> operations. It is optimized for data warehousing or analytics<a href=3D"=
http://prestodb.io/docs/current/overview/use-cases.html" rel=3D"nofollow">=
=C2=B9</a>)</li><li><a href=3D"https://www.mapr.com/why-hadoop/sql-hadoop" =
rel=3D"nofollow">SQL-Hadoop</a> via <a href=3D"https://www.mapr.com/product=
s/hadoop-download" rel=3D"nofollow">MapR community edition</a> (seems to be=
 a packaging of Hive, <a href=3D"http://www.vertica.com/hp-vertica-products=
/sqlonhadoop" rel=3D"nofollow">HP Vertica</a>, SparkSQL, Drill and a <a hre=
f=3D"http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC" rel=3D"nofollow">na=
tive ODBC wrapper</a>)</li><li><a href=3D"http://www.kylin.io" rel=3D"nofol=
low">Apache Kylin</a> from Ebay (provides an SQL interface and multi-dimens=
ional analysis [<a href=3D"http://en.wikipedia.org/wiki/OLAP" rel=3D"nofoll=
ow">OLAP</a>],
 &quot;=E2=80=A6 offers ANSI SQL on Hadoop and supports most ANSI SQL query=
=20
functions&quot;. It depends on HDFS, MapReduce, Hive and HBase; and seems=
=20
targeted at very large data-sets though maintains low query latency)</li><l=
i><a href=3D"http://tajo.apache.org" rel=3D"nofollow">Apache Tajo</a> (ANSI=
/ISO SQL standard compliance with <a href=3D"http://en.wikipedia.org/wiki/J=
DBC" rel=3D"nofollow">JDBC</a> driver support [<a href=3D"http://blogs.gart=
ner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space" rel=3D"n=
ofollow">benchmarks against Hive and Impala</a>])</li><li><a href=3D"http:/=
/en.wikipedia.org/wiki/Cascading_%28software%29" rel=3D"nofollow">Cascading=
</a>&#39;s <a href=3D"http://docs.cascading.org/lingual/1.0/" rel=3D"nofoll=
ow">Lingual</a><a href=3D"http://docs.cascading.org/lingual/1.0/#sql-suppor=
t" rel=3D"nofollow">=C2=B2</a>
 (&quot;Lingual provides JDBC Drivers, a SQL command shell, and a catalog=
=20
manager for publishing files [or any resource] as schemas and tables.&quot;=
)</li></ul>

<p></p><p>Which=E2=80=94from this list or elsewhere=E2=80=94would you recom=
mend, and why?</p>

    </div><div><div class=3D"gmail_signature">Thanks for all suggestions,<b=
r><br>Samuel Marks<br><a href=3D"http://linkedin.com/in/samuelmarks" target=
=3D"_blank">http://linkedin.com/in/samuelmarks</a></div></div>
</div>

--001a1137c746f702af050d8ed4b6--