drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Hunt <adamph...@gmail.com>
Subject Re: Apache Drill Vs Spark SQL
Date Wed, 29 Oct 2014 21:53:20 GMT
Hi Tridib and Neeraja,

Although Spark SQL has some boiler plate, it can discover the schema of
Parquet files just like Drill.  You are correct that Hive and Impala still
require you to create a table.
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v1/latest/Installing-and-Using-Impala/ciiu_parquet.html

Adam

On Wed, Oct 29, 2014 at 11:18 AM, Neeraja Rentachintala <
nrentachintala@maprtech.com> wrote:

> Tridib
>
> If you are getting started with Drill, you can also refer to a tutorial
> which goes through various Drill's capabilities.
> https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Tutorial
>
> You are spot on the metadata part. Discovering metadata dynamically and
> providing ability to work with complex datatypes such as JSON without
> transformation is a key difference for Drill compared to SparkSQL and other
> SQL options.
>
> -Neeraja
>
>
> On Wed, Oct 29, 2014 at 11:12 AM, Tridib Samanta <tridib.samanta@live.com>
> wrote:
>
> > Hi Adam,
> > Thanks for sharing this! Apache Drill is very easy to get started. I
> liked
> > the part that Drill manages the meta data part by itself and does not
> > required Hive (like Spark).
> >
> > Thanks
> > Tridib
> >
> > > Date: Wed, 29 Oct 2014 10:50:37 -0700
> > > Subject: Re: Apache Drill Vs Spark SQL
> > > From: adamphunt@gmail.com
> > > To: drill-user@incubator.apache.org
> > >
> > > Hi Tridib,
> > >
> > > I just completed a simple evaluation of Drill 0.6.0 and Spark SQL
> > 1.1.0.  I
> > > ran a few queries over 14GB of Snappy compressed Parquet files on a
> four
> > > server MapR cluster (96 cores, 256 GB).  Here are the results.
> > >
> > > Spark SQL requires some very very minor setup, where Drill doesn't.
> > > val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> > > val testData = sqlContext.parquetFile("/user/ahunt/test/2014/10/28/")
> > > testData.registerTempTable("testData")
> > >
> > > In Drill, a simple count query took 19s the first time and 0.9s the
> > second
> > > time
> > > SELECT count(*) FROM  dfs.`/user/ahunt/test/2014/10/28/part-*`;
> > >
> > > In Spark SQL, it took 17s the first time and 1.7s the second
> > > sqlContext.sql("SELECT count(*) FROM
> > testData").collect().foreach(println)
> > >
> > > In Drill, a simple group by query printed the results, but would not
> > return
> > > to the prompt without hitting ctrl-c (after 6s).
> > > SELECT httpResponseCode, count(*) FROM
> > > dfs.`/user/ahunt/test/2014/10/28/part-*` GROUP BY httpResponseCode;
> > >
> > > In Spark SQL, it finished in 3.6s
> > > sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
> > > httpResponseCode").collect().foreach(println)
> > >
> > > In Drill, this query never finished (probably due to the issue
> described
> > > above).
> > > SELECT httpResponseCode, count(*) FROM
> > > dfs.`/user/ahunt/test/2014/10/28/` GROUP
> > > BY httpResponseCode ORDER BY httpResponseCode DESC;
> > >
> > > In Spark SQL, the same query finished in 5s.
> > > sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
> > > httpResponseCode ORDER BY httpResponseCode
> > DESC").collect().foreach(println)
> > >
> > > Although Drill seems very promising, it seems that it has a few issues
> to
> > > work out, and since I already use Spark I'm going to stick with Spark
> SQL
> > > for now.
> > >
> > > Adam
> > >
> > >
> > > On Wed, Oct 29, 2014 at 10:00 AM, Tridib Samanta <
> > tridib.samanta@live.com>
> > > wrote:
> > >
> > > > Hello Experts,
> > > > I am new in Apache Drill. To me it's very similar to Spark SQL. I was
> > > > wandering how does it differ from Spark SQL. What are the use case
> > where
> > > > Apache Drill thrives compare to Spark SQL?
> > > >
> > > > Thanks & Regards
> > > > Tridib
> > > >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message