Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7A7F617362 for ; Mon, 26 Jan 2015 14:20:52 +0000 (UTC) Received: (qmail 74124 invoked by uid 500); 26 Jan 2015 14:20:46 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 73979 invoked by uid 500); 26 Jan 2015 14:20:46 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 73947 invoked by uid 99); 26 Jan 2015 14:20:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jan 2015 14:20:38 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of samuelmarks@gmail.com designates 209.85.213.49 as permitted sender) Received: from [209.85.213.49] (HELO mail-yh0-f49.google.com) (209.85.213.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jan 2015 14:19:56 +0000 Received: by mail-yh0-f49.google.com with SMTP id v1so3561055yhn.8 for ; Mon, 26 Jan 2015 06:19:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=Do2PKYkWnSCXuE6uFparZZzDvwkgoasZERRQ3DbaXuw=; b=gjq0XQ6uZweAsTlFqe3Zfz2nGpiZyVq4KgxfrjYe+I1v9GlZdI6e8Sw4jhk9bs/VRC lRk1ccc6WNGeSAScnBdIHKCTgRQ8JDxeRsPi6/qq41FdVFJXylaNRsXJidbk9MhuM1x4 5oQDzKTEFo9afl8qhx9YeowL30qmd24dwH8RHZiKjobS2Cwrt4lXi3cDZF2Z2wRtLIDq Q50kclOG4ZAApg5r6jpEQpsW4oiXtH3mMniDdHWgI8tUu5O+xWV3eQZl+Xo5wNXwlTib yKqw/cGDPUcTghkG+pQ6HfaEUGIfSco4ASup/bh1esrgM4SusdahWOcNBRCJTygns+Yy IfTg== MIME-Version: 1.0 X-Received: by 10.170.123.1 with SMTP id p1mr10914791ykb.89.1422281993026; Mon, 26 Jan 2015 06:19:53 -0800 (PST) Received: by 10.170.114.197 with HTTP; Mon, 26 Jan 2015 06:19:52 -0800 (PST) Date: Tue, 27 Jan 2015 01:19:52 +1100 Message-ID: Subject: Which [open-souce] SQL engine atop Hadoop? From: Samuel Marks To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a1137c746f702af050d8ed4b6 X-Virus-Checked: Checked by ClamAV on apache.org --001a1137c746f702af050d8ed4b6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Since Hadoop came out, there have been various commercial and/or open-source attempts to expose some compatibility with SQ= L . I am seeking one which is good for low-latency querying, and supports the most common CRUD , including [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET C1=3D2 WHERE, DELETE FROM, and DROP TABLE. I will be utilising them from Python, however there does seem to be a Pytho= n JDBC wrapper . Additionally it needs to be scalable for big and small data (starting on a single-node "cluster"). Here is what I've found thus far: - Apache Hive (SQL-like, with interactive SQL thanks to the Stinger initiative) - Apache Drill (ANSI SQL support) - Apache Spark (Spark SQL , queries only, add data via Hive, RDD or Paraquet ) - Apache Phoenix (built atop Apache HBase , lacks full transaction support, relational operators and some built-in functions) - Presto from Facebook (can query Hive, Cassandra , relational DBs &etc. Doesn't seem to be designed for low-latency responses across small clusters, or support UPDATE operations. It is optimized for data warehousing or analytics=C2=B9 ) - SQL-Hadoop via MapR community edition (seems to be a packaging of Hive, HP Vertica , SparkSQL, Drill and a native ODBC wrapper ) - Apache Kylin from Ebay (provides an SQL interface and multi-dimensional analysis [OLAP ], "=E2=80=A6 offers ANSI SQL on Hado= op and supports most ANSI SQL query functions". It depends on HDFS, MapReduce, Hive and HBase; and seems targeted at very large data-sets though mainta= ins low query latency) - Apache Tajo (ANSI/ISO SQL standard compliance with JDBC driver support [benchmarks against Hive and Impala ]) - Cascading 's Lingual =C2=B2 ("Lingual provides JDBC Drivers, a SQL command shell, and a catalog manager for publishing files [or any resource] as schemas and tables.") Which=E2=80=94from this list or elsewhere=E2=80=94would you recommend, and = why? Thanks for all suggestions, Samuel Marks http://linkedin.com/in/samuelmarks --001a1137c746f702af050d8ed4b6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Since Hadoo= p came out, there have been various commercial and/or open-source attem= pts to expose some compatibility with SQL.

I am seeking one which is good for low-latency querying, and supports th= e most common CRUD, including [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET = C1=3D2 WHERE, DELETE FROM, and DROP TABLE.=

I will be utilising them from Python, however there does seem to be a Python JDBC wrapper= . Additionally it needs to be scalable for big and small data (starting= on a single-node "cluster").

Here is what I've found thus far:

Which=E2=80=94from this list or elsewhere=E2=80=94would you recom= mend, and why?

Thanks for all suggestions,
Samuel Marks
http://linkedin.com/in/samuelmarks
--001a1137c746f702af050d8ed4b6--