incubator-drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Constantine Peresypkin <pconstant...@gmail.com>
Subject Re: In-place processing and performance.
Date Wed, 19 Sep 2012 04:40:49 GMT
> Columnar cache will make the next query fast.

Why is that? What is the difference between columnar cache and disk cache
then?

> Scanners will be in whatever language the authors write them in.

No problem with that, I've just explained why there will be C-scanners.

On Wed, Sep 19, 2012 at 7:35 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> On Tue, Sep 18, 2012 at 6:30 PM, Constantine Peresypkin <
> pconstantine@gmail.com> wrote:
>
> > 1. I don't see why cache should be in columnar format. The only purpose
> of
> > Dremel columnar format is to accelerate full table scans. That's it.
> >
>
> The cache is to make things fast.
>
> Columnar cache will make the next query fast.
>
>
> > 2. Scanners will be in C for performance reasons. Dremel idea = scan
> > performance.
> >
>
> Scanners will be in whatever language the authors write them in.  I think
> we need to preserve the option to write them in whatever language fits.
>  Some serializations only have bindings in, say, Java.
>
>
>
> >
> > On Wed, Sep 19, 2012 at 12:58 AM, moon soo Lee <leemoonsoo@gmail.com>
> > wrote:
> >
> > > i agree, working version first, and optimization later.
> > >
> > > Are there good reason that many input scanners expected in C?
> > >
> > >
> > >
> > > On Tue, Sep 18, 2012 at 12:11 PM, Ted Dunning <ted.dunning@gmail.com>
> > > wrote:
> > >
> > > > I also generally agree, but I really think that we need a bit of
> > > experience
> > > > with a simple working version of Drill first.
> > > >
> > > > Also, anything like this is going to have to recognize that there are
> > > > likely to be multiple columnar formats and that some (many) input
> > > scanners
> > > > are going to be coded in C, not just Java.
> > > >
> > > > On Mon, Sep 17, 2012 at 7:51 PM, Azuryy Yu <azuryyyu@gmail.com>
> wrote:
> > > >
> > > > > Thanks!
> > > > >
> > > > > Generally agree, but Cache and Data manipulation should be
> separated.
> > > > every
> > > > > query reach cache firstly, if not hit, then call the read data
> > > interface,
> > > > > which cannot be included in the cache module.
> > > > >
> > > > > so everybody can replace cache policy and read/write data. then can
> > > > > configure drill.cache.policy.class and drill.read.class
> > > drill.write.class
> > > > > in the configure file.
> > > > >
> > > > >
> > > > > On Tue, Sep 18, 2012 at 10:23 AM, moon soo Lee <
> leemoonsoo@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Here's my quick drill's common caching framework proposal.
> > > > > >
> > > > > > 0. Why
> > > > > >
> > > > > >    - While In-place processing, data format is not guaranteed
the
> > > best
> > > > > >    efficient format to process (ie. columnar).
> > > > > >    - Non-columnar format can make huge performance impact. (order
> > of
> > > > > >    magnitude)
> > > > > >
> > > > > >
> > > > > > 1. Goal.
> > > > > >
> > > > > >    - Increase performance without painful ETL
> > > > > >    - Performance includes not only overall throughput but also
> how
> > > > > >    interactive it is.
> > > > > >    - Provide easy implementation interface to datasource point
of
> > > view
> > > > > >
> > > > > >
> > > > > > 2. How it looks?
> > > > > >
> > > > > >    - Drill provide common caching policy. Which is responsible
> for
> > > > > >
> > > > > >    - construct columnar format
> > > > > >    - read columnar format
> > > > > >    - caching algorithm
> > > > > >
> > > > > >
> > > > > >    - Each datasource optionally implements some method to support
> > > > > caching,
> > > > > >    they could be
> > > > > >
> > > > > >    interface CachingSupport {
> > > > > >
> > > > > >    // to write columnar format data to cache media
> > > > > >    OutputStream getOutputStream(path);
> > > > > >
> > > > > >    // to clear cached data
> > > > > >    void remove(path);
> > > > > >
> > > > > >    // to read cached data
> > > > > >    InputStream getInputStream(path);
> > > > > >
> > > > > >    // to get location information of data (in DFS)
> > > > > >    Location getLocation(path);
> > > > > >
> > > > > >    }
> > > > > >
> > > > > >    - The datasource implementation does not care about columnar
> > > format,
> > > > > >    cache replacement policy, things. only care about basic IO.
So
> > > > people
> > > > > > who
> > > > > >    implement datasource does not need to understand columnar
> > things.
> > > > > >
> > > > > >
> > > > > > 3. How it works?
> > > > > >
> > > > > >    - Drill construct columnar format cache using datasource
> > provided
> > > > > > method.
> > > > > >    - Datasource can skip the implementation for the caching.
This
> > > time,
> > > > > >    drill work passthru mode.
> > > > > >    - Cache policy class can be replaced. So if there's more
> > efficient
> > > > > data
> > > > > >    format, efficient algorithm it can be applied, without
> changing
> > > all
> > > > > >    datasource implementation.
> > > > > >    - Cache construction does not block data read. So performance
> > > impact
> > > > > >    from cache construction is minimized.
> > > > > >    - Drill performs it's query through cache. There could be
some
> > > query
> > > > > for
> > > > > >    cache management (like purge).
> > > > > >
> > > > > >
> > > > > >
> > > > > > Is it worth? or just adding a complexity?
> > > > > >
> > > > > > for me, worth +1.
> > > > > >
> > > > > > and i'm fully ready to do this job. :-)
> > > > > >
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > ----
> > > > > >
> > > > > > Leemoonsoo
> > > > > > moon@nflabs.com
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 18, 2012 at 1:59 AM, Tomer Shiran <
> > tshiran@maprtech.com>
> > > > > > wrote:
> > > > > >
> > > > > > > The plan was to have the scan operator do that kind of
caching,
> > > but I
> > > > > > agree
> > > > > > > it could make sense to have some common caching framework
in
> case
> > > > other
> > > > > > > scan operators want to cache as well.
> > > > > > >
> > > > > > > On Sun, Sep 16, 2012 at 5:29 PM, moon soo Lee <moon@nflabs.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Drill want In-place processing ([1], page 12). yes,
ETL is
> > > painful.
> > > > > > > > In my understanding, In-place processing means the
data is
> not
> > > > always
> > > > > > > > columnar.
> > > > > > > >
> > > > > > > > [2], Figure 10, shows performance difference between
columnar
> > and
> > > > > > > > record-oriented (MR)
> > > > > > > > if Dremel work with record-oriented data, I can guess
that'll
> > be
> > > > > order
> > > > > > of
> > > > > > > > magnitude slower.
> > > > > > > >
> > > > > > > > If it's true, will this still interactive?
> > > > > > > >
> > > > > > > > And can anyone give an more detail about "Adaptively
convert
> > > > storage
> > > > > > > layout
> > > > > > > > into more efficient forms", [1], page 12 ?
> > > > > > > > Is it kind of transparent columnar format caching?
> > > > > > > >
> > > > > > > > And if non-columnar data expected in many cases,
> > > > > > > > then how about drill have common cache for storage
interface
> > > > instead
> > > > > of
> > > > > > > > each scanner implements their own caching policies?
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > [1] Apache Drill, Architecture outlines.
> > > > > > > >
> > > http://www.slideshare.net/jasonfrantz/drill-architecture-20120913
> > > > > > > > [2] Dremel: Interactive Analysis of Web-Scale Datasets
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36632.pdf
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Tomer Shiran
> > > > > > > Director of Product Management | MapR Technologies |
> > 650-804-8657
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message