incubator-drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neeraja Rentachintala <nrentachint...@maprtech.com>
Subject Re: Apache Drill Vs Spark SQL
Date Wed, 29 Oct 2014 18:18:22 GMT
Tridib

If you are getting started with Drill, you can also refer to a tutorial
which goes through various Drill's capabilities.
https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Tutorial

You are spot on the metadata part. Discovering metadata dynamically and
providing ability to work with complex datatypes such as JSON without
transformation is a key difference for Drill compared to SparkSQL and other
SQL options.

-Neeraja


On Wed, Oct 29, 2014 at 11:12 AM, Tridib Samanta <tridib.samanta@live.com>
wrote:

> Hi Adam,
> Thanks for sharing this! Apache Drill is very easy to get started. I liked
> the part that Drill manages the meta data part by itself and does not
> required Hive (like Spark).
>
> Thanks
> Tridib
>
> > Date: Wed, 29 Oct 2014 10:50:37 -0700
> > Subject: Re: Apache Drill Vs Spark SQL
> > From: adamphunt@gmail.com
> > To: drill-user@incubator.apache.org
> >
> > Hi Tridib,
> >
> > I just completed a simple evaluation of Drill 0.6.0 and Spark SQL
> 1.1.0.  I
> > ran a few queries over 14GB of Snappy compressed Parquet files on a four
> > server MapR cluster (96 cores, 256 GB).  Here are the results.
> >
> > Spark SQL requires some very very minor setup, where Drill doesn't.
> > val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> > val testData = sqlContext.parquetFile("/user/ahunt/test/2014/10/28/")
> > testData.registerTempTable("testData")
> >
> > In Drill, a simple count query took 19s the first time and 0.9s the
> second
> > time
> > SELECT count(*) FROM  dfs.`/user/ahunt/test/2014/10/28/part-*`;
> >
> > In Spark SQL, it took 17s the first time and 1.7s the second
> > sqlContext.sql("SELECT count(*) FROM
> testData").collect().foreach(println)
> >
> > In Drill, a simple group by query printed the results, but would not
> return
> > to the prompt without hitting ctrl-c (after 6s).
> > SELECT httpResponseCode, count(*) FROM
> > dfs.`/user/ahunt/test/2014/10/28/part-*` GROUP BY httpResponseCode;
> >
> > In Spark SQL, it finished in 3.6s
> > sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
> > httpResponseCode").collect().foreach(println)
> >
> > In Drill, this query never finished (probably due to the issue described
> > above).
> > SELECT httpResponseCode, count(*) FROM
> > dfs.`/user/ahunt/test/2014/10/28/` GROUP
> > BY httpResponseCode ORDER BY httpResponseCode DESC;
> >
> > In Spark SQL, the same query finished in 5s.
> > sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
> > httpResponseCode ORDER BY httpResponseCode
> DESC").collect().foreach(println)
> >
> > Although Drill seems very promising, it seems that it has a few issues to
> > work out, and since I already use Spark I'm going to stick with Spark SQL
> > for now.
> >
> > Adam
> >
> >
> > On Wed, Oct 29, 2014 at 10:00 AM, Tridib Samanta <
> tridib.samanta@live.com>
> > wrote:
> >
> > > Hello Experts,
> > > I am new in Apache Drill. To me it's very similar to Spark SQL. I was
> > > wandering how does it differ from Spark SQL. What are the use case
> where
> > > Apache Drill thrives compare to Spark SQL?
> > >
> > > Thanks & Regards
> > > Tridib
> > >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message