drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Chen <tnac...@gmail.com>
Subject Re: Drill with Spark
Date Sat, 17 May 2014 14:43:29 GMT
Druid just like redshift requires an extra ETL to import the data before you can query, which
slows down the freshness of your query able data.

Obvious three are pros and cons to each decision, but Drill also tries to do optimizations
as much as possible with metadata available, and also down the road will able to again enough
stats after a scan or perhaps even a extra compute stats like what impala does.

Tim

Sent from my iPhone

> On May 17, 2014, at 12:27 AM, Amit Matety <matety@yahoo.com> wrote:
> 
> In the regards to comparison: How does it compare to Druid which is also an in-memory
warehouse ? Does Drill support joins to in memory dimension tables unlike Druid? Does it have
any limitation on the number of records it can fetch, etc?
> 
> Regards,
> Amit
> 
>> On May 16, 2014, at 8:46 PM, Jason Altekruse <altekrusejason@gmail.com> wrote:
>> 
>> Ted covered the most important points. I just want to add a few
>> clarifications.
>> 
>> While the code for Drill so far is written in pure Java, there is not
>> specific requirement that all of Drill run in Java. Part of the motivation
>> for using the in-memory representation of records that we did, making it
>> columnar, and also storing it in java native ByteBuffers, was to enable
>> integration with native code compiled from C/C++ to run some of our
>> operators. ByteBuffers are part of the official Java API, but their use is
>> not recommend. They allow memory operations that you do not find in typical
>> java data types and structures, but require you to manage your own memory.
>> 
>> One important use case for us is the ability to pass them through the Java
>> Native Interface without having to do a copy. While it is still inefficient
>> to jump from Java to C every record, we should be able to define a clean
>> interface to take a batch of records (around 1000) in a single jump to a C
>> context and after the C code finishes processing them, a single jump back
>> into the java context will also be able to complete quickly in the same
>> manner as the jump in the other direction.
>> 
>> With this consideration, any language you could pass data to from C would
>> be compatible. While we likely will not support a wide array of plugin
>> languages soon, it should be possible for people to plug in a variety of
>> existing codebases for adding data processing functionalities to Drill.
>> 
>> -Jason Altekruse
>> 
>> 
>>> On Fri, May 16, 2014 at 8:11 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>>> 
>>> Drill is a very different tool from spark or even from Spark SQL (aka
>>> Shark).
>>> 
>>> There is some overlap, but there are important differences.  For instance,
>>> 
>>> - Drill supports weakly typed SQL.
>>> 
>>> - Drill has a very clever way to pass data from one processor to another.
>>> This allows very efficient processing
>>> 
>>> - Drill generates code in response to query and to observed data.  This is
>>> a big deal since it allows high speed with dynamic types
>>> 
>>> - Drill supports full ANSII SQL, not Hive QL.
>>> 
>>> - Spark supports programming in Scala
>>> 
>>> - Spark ties distributed data object to objects in a language like Java or
>>> Scala rather than using a columnar form.  This makes generic user written
>>> code easier, but is less efficient.
>>> 
>>> 
>>> 
>>> 
>>> On Thu, May 15, 2014 at 9:41 AM, N.Venkata Naga Ravi
>>> <nvn_ravi@hotmail.com>wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I started exploring Drill , it looks like very interesting tool. Can some
>>>> body explain how Drill is going to compare with Apache Spark and Storm.
>>>> Do we still need Apache Spark along with Drill in the Bigdata stack? Or
>>>> Drill can directly support as replacement with Spark?
>>>> 
>>>> Thanks,
>>>> Ravi
>>> 

Mime
View raw message