incubator-drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Constantine Peresypkin <pconstant...@gmail.com>
Subject Re: In-place processing and performance.
Date Wed, 19 Sep 2012 01:30:21 GMT
1. I don't see why cache should be in columnar format. The only purpose of
Dremel columnar format is to accelerate full table scans. That's it.
2. Scanners will be in C for performance reasons. Dremel idea = scan
performance.

On Wed, Sep 19, 2012 at 12:58 AM, moon soo Lee <leemoonsoo@gmail.com> wrote:

> i agree, working version first, and optimization later.
>
> Are there good reason that many input scanners expected in C?
>
>
>
> On Tue, Sep 18, 2012 at 12:11 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > I also generally agree, but I really think that we need a bit of
> experience
> > with a simple working version of Drill first.
> >
> > Also, anything like this is going to have to recognize that there are
> > likely to be multiple columnar formats and that some (many) input
> scanners
> > are going to be coded in C, not just Java.
> >
> > On Mon, Sep 17, 2012 at 7:51 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
> >
> > > Thanks!
> > >
> > > Generally agree, but Cache and Data manipulation should be separated.
> > every
> > > query reach cache firstly, if not hit, then call the read data
> interface,
> > > which cannot be included in the cache module.
> > >
> > > so everybody can replace cache policy and read/write data. then can
> > > configure drill.cache.policy.class and drill.read.class
> drill.write.class
> > > in the configure file.
> > >
> > >
> > > On Tue, Sep 18, 2012 at 10:23 AM, moon soo Lee <leemoonsoo@gmail.com>
> > > wrote:
> > >
> > > > Here's my quick drill's common caching framework proposal.
> > > >
> > > > 0. Why
> > > >
> > > >    - While In-place processing, data format is not guaranteed the
> best
> > > >    efficient format to process (ie. columnar).
> > > >    - Non-columnar format can make huge performance impact. (order of
> > > >    magnitude)
> > > >
> > > >
> > > > 1. Goal.
> > > >
> > > >    - Increase performance without painful ETL
> > > >    - Performance includes not only overall throughput but also how
> > > >    interactive it is.
> > > >    - Provide easy implementation interface to datasource point of
> view
> > > >
> > > >
> > > > 2. How it looks?
> > > >
> > > >    - Drill provide common caching policy. Which is responsible for
> > > >
> > > >    - construct columnar format
> > > >    - read columnar format
> > > >    - caching algorithm
> > > >
> > > >
> > > >    - Each datasource optionally implements some method to support
> > > caching,
> > > >    they could be
> > > >
> > > >    interface CachingSupport {
> > > >
> > > >    // to write columnar format data to cache media
> > > >    OutputStream getOutputStream(path);
> > > >
> > > >    // to clear cached data
> > > >    void remove(path);
> > > >
> > > >    // to read cached data
> > > >    InputStream getInputStream(path);
> > > >
> > > >    // to get location information of data (in DFS)
> > > >    Location getLocation(path);
> > > >
> > > >    }
> > > >
> > > >    - The datasource implementation does not care about columnar
> format,
> > > >    cache replacement policy, things. only care about basic IO. So
> > people
> > > > who
> > > >    implement datasource does not need to understand columnar things.
> > > >
> > > >
> > > > 3. How it works?
> > > >
> > > >    - Drill construct columnar format cache using datasource provided
> > > > method.
> > > >    - Datasource can skip the implementation for the caching. This
> time,
> > > >    drill work passthru mode.
> > > >    - Cache policy class can be replaced. So if there's more efficient
> > > data
> > > >    format, efficient algorithm it can be applied, without changing
> all
> > > >    datasource implementation.
> > > >    - Cache construction does not block data read. So performance
> impact
> > > >    from cache construction is minimized.
> > > >    - Drill performs it's query through cache. There could be some
> query
> > > for
> > > >    cache management (like purge).
> > > >
> > > >
> > > >
> > > > Is it worth? or just adding a complexity?
> > > >
> > > > for me, worth +1.
> > > >
> > > > and i'm fully ready to do this job. :-)
> > > >
> > > >
> > > > Thanks.
> > > >
> > > > ----
> > > >
> > > > Leemoonsoo
> > > > moon@nflabs.com
> > > >
> > > >
> > > > On Tue, Sep 18, 2012 at 1:59 AM, Tomer Shiran <tshiran@maprtech.com>
> > > > wrote:
> > > >
> > > > > The plan was to have the scan operator do that kind of caching,
> but I
> > > > agree
> > > > > it could make sense to have some common caching framework in case
> > other
> > > > > scan operators want to cache as well.
> > > > >
> > > > > On Sun, Sep 16, 2012 at 5:29 PM, moon soo Lee <moon@nflabs.com>
> > wrote:
> > > > >
> > > > > > Drill want In-place processing ([1], page 12). yes, ETL is
> painful.
> > > > > > In my understanding, In-place processing means the data is not
> > always
> > > > > > columnar.
> > > > > >
> > > > > > [2], Figure 10, shows performance difference between columnar
and
> > > > > > record-oriented (MR)
> > > > > > if Dremel work with record-oriented data, I can guess that'll
be
> > > order
> > > > of
> > > > > > magnitude slower.
> > > > > >
> > > > > > If it's true, will this still interactive?
> > > > > >
> > > > > > And can anyone give an more detail about "Adaptively convert
> > storage
> > > > > layout
> > > > > > into more efficient forms", [1], page 12 ?
> > > > > > Is it kind of transparent columnar format caching?
> > > > > >
> > > > > > And if non-columnar data expected in many cases,
> > > > > > then how about drill have common cache for storage interface
> > instead
> > > of
> > > > > > each scanner implements their own caching policies?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > [1] Apache Drill, Architecture outlines.
> > > > > >
> http://www.slideshare.net/jasonfrantz/drill-architecture-20120913
> > > > > > [2] Dremel: Interactive Analysis of Web-Scale Datasets
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36632.pdf
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Tomer Shiran
> > > > > Director of Product Management | MapR Technologies | 650-804-8657
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message