drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Phillips <ste...@dremio.com>
Subject Re: [DISCUSS] Ideas to improve metadata cache read performance
Date Fri, 30 Oct 2015 22:22:00 GMT
My view on storing it in some other format is that, yes, it will probably
reduce the size of the file, but if we gzip the json file, it should be
pretty compact. As for deserialization cost, other formats would be faster,
but not dramatically faster. Certainly not the order of magnitude faster
that we really need it to be. The reason we chose JSON was because it is
readable and easier to deal with.

As for the old code, I can point you at a branch, but it's probably not
very helpful. Unless we want to essentially disable value-based partition
pruning when using the cache, the old code will not work.

My recommendation would be to come up with a new version of the format
which stores only the name and value of columns which are single-valued for
each file or row group. This will allow partition pruning to work, but some
count queries may not be as fast any more, because the cache won't have
column value counts on a per-rowgroup basis any more.

Anyway, here is the link to the original branch.

https://github.com/StevenMPhillips/drill/tree/meta

On Fri, Oct 30, 2015 at 3:01 PM, Parth Chandra <parthc@apache.org> wrote:

> Hey Jacques, Steven,
>
>   Do we have a branch somewhere which has the initial prototype code? I'd
> like to prune the file a bit as it looks like reducing the size of the
> metadata cache file might yield the best results.
>
>   Also, did we have a particular reason for going with JSON as opposed to a
> more compact binary format? Are there any arguments against saving this as
> a protobuf/BSON/Parquet file?
>
> Parth
>
> On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau <jacques@dremio.com>
> wrote:
>
> > My first thought is we've gotten too generous in what we're storing in
> the
> > Parquet metadata file. Early implementations were very lean and it seems
> > far larger today. For example, early implementations didn't keep
> statistics
> > and ignored row groups (files, schema and block locations only). If we
> need
> > multiple levels of information, we may want to stagger (or normalize)
> them
> > in the file. Also, we may think about what is the minimum that must be
> done
> > in planning. We could do the file pruning at execution time rather than
> > single-tracking these things (makes stats harder though).
> >
> > I also think we should be cautious around jumping to a conclusion until
> > DRILL-3973 provides more insight.
> >
> > In terms of caching, I'd be more inclined to rely on file system caching
> > and make sure serialization/deserialization is as efficient as possible
> as
> > opposed to implementing an application-level cache. (We already have
> enough
> > problems managing memory without having to figure out when we should
> drop a
> > metadata cache :D).
> >
> > Aside, I always liked this post for entertainment and the thoughts on
> > virtual memory: https://www.varnish-cache.org/trac/wiki/ArchitectNotes
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes <hgunes@maprtech.com>
> wrote:
> >
> > > One more thing, for workloads running queries over subsets of same
> > parquet
> > > files, we can consider maintaining an in-memory cache as well. Assuming
> > > metadata memory footprint per file is low and parquet files are static,
> > not
> > > needing us to invalidate the cache often.
> > >
> > > H+
> > >
> > > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes <hgunes@maprtech.com>
> > wrote:
> > >
> > > > I am not familiar with the contents of metadata stored but if
> > > > deserialization workload seems to be fitting to any of afterburner's
> > > > claimed improvement points [1] It could well be worth trying given
> the
> > > > claimed gain on throughput is substantial.
> > > >
> > > > It could also be a good idea to partition caching over a number of
> > files
> > > > for better parallelization given number of cache files generated is
> > > > *significantly* less than number of parquet files. Maintaining global
> > > > statistics seems an improvement point too.
> > > >
> > > >
> > > > -H+
> > > >
> > > > 1:
> > > >
> > >
> >
> https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized
> > > >
> > > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha <amansinha@apache.org>
> > > wrote:
> > > >
> > > >> Forgot to include the link for Jackson's AfterBurner module:
> > > >>   https://github.com/FasterXML/jackson-module-afterburner
> > > >>
> > > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <amansinha@apache.org>
> > > wrote:
> > > >>
> > > >> > I was going to file an enhancement JIRA but thought I will discuss
> > > here
> > > >> > first:
> > > >> >
> > > >> > The parquet metadata cache file is a JSON file that contains
a
> > subset
> > > of
> > > >> > the metadata extracted from the parquet files.  The cache file
can
> > get
> > > >> > really large .. a few GBs for a few hundred thousand files.
> > > >> > I have filed a separate JIRA: DRILL-3973 for profiling the various
> > > >> aspects
> > > >> > of planning including metadata operations.  In the meantime,
the
> > > >> timestamps
> > > >> > in the drillbit.log output indicate a large chunk of time spent
in
> > > >> creating
> > > >> > the drill table to begin with, which indicates bottleneck in
> reading
> > > the
> > > >> > metadata.  (I can provide performance numbers later once we
> confirm
> > > >> through
> > > >> > profiling).
> > > >> >
> > > >> > A few thoughts around improvements:
> > > >> >  - The jackson deserialization of the JSON file is very slow..
can
> > > this
> > > >> be
> > > >> > speeded up ? .. for instance the AfterBurner module of jackson
> > claims
> > > to
> > > >> > improve performance by 30-40% by avoiding the use of reflection.
> > > >> >  - The cache file read is a single threaded process.  If we were
> > > >> directly
> > > >> > reading from parquet files, we use a default of 16 threads. 
What
> > can
> > > be
> > > >> > done to parallelize the read ?
> > > >> >  - Any operation that can be done one time during the REFRESH
> > METADATA
> > > >> > command ?  for instance..examining the min/max values to determine
> > > >> > single-value for partition column could be eliminated if we do
> this
> > > >> > computation during REFRESH METADATA command and store the summary
> > one
> > > >> time.
> > > >> >
> > > >> >  - A pertinent question is: should the cache file be stored in
a
> > more
> > > >> > efficient format such as Parquet instead of JSON ?
> > > >> >
> > > >> > Aman
> > > >> >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message