drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Altekruse <altekruseja...@gmail.com>
Subject Re: [DISCUSS] Ideas to improve metadata cache read performance
Date Thu, 29 Oct 2015 18:55:35 GMT
Overall the best test for this feature would be a performance comparison
between a query that can prune vs one that cannot. No plan verification is
really needed. This actually captures any issues in increased planning or
execution time, albeit not isolated by cause, but query profiles can tell
us where any increase in total query time has crept in.

On Thu, Oct 29, 2015 at 11:33 AM, Steven Phillips <steven@dremio.com> wrote:

> I agree that this would present a small challenge for testing, but I don't
> think ease of testing should be the primary motivator in designing the
> software. Once we've decided what we want the software to do, then we can
> work together to figure out how to test it.
>
> On Thu, Oct 29, 2015 at 11:09 AM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
> > @steven If we end up pushing the partition pruning to the execution
> phase,
> > how would we know that partition pruning even took place. I am thinking
> > from the standpoint of adding functional tests around partition pruning.
> >
> > - Rahul
> >
> > On Wed, Oct 28, 2015 at 10:53 AM, Parth Chandra <parthc@apache.org>
> wrote:
> >
> > > And ideally, I suppose, the merged schema would correspond to the
> > > information that we want to keep in a .drill file.
> > >
> > >
> > > On Tue, Oct 27, 2015 at 4:55 PM, Aman Sinha <asinha@maprtech.com>
> wrote:
> > >
> > > > @Steven, w.r.t to your suggestion about doing the metadata operation
> > > during
> > > > execution phase, see the related discussion in DRILL-3838.
> > > >
> > > > A couple of more thoughts:
> > > >  - Parth and I were discussing keeping track of the merged schema as
> > part
> > > > of the refresh metadata and storing the merged schema for all files
> > that
> > > > have the identical schema (currently this is repeated and is a huge
> > > > contributor to the size of the file).   To Jacques' point about
> keeping
> > > > minimum information needed for planning purposes,  we certainly could
> > do
> > > a
> > > > better job in keeping it lean.   The row count of the table could be
> > > > computed at the time of running refresh metadata command.  Similarly
> > the
> > > > analysis of single-value can be done at that time instead of on a
> > > per-query
> > > > basis.
> > > >
> > > >  - We should revisit DRILL-2517(
> > > > https://issues.apache.org/jira/browse/DRILL-2517)
> > > >   Consider the following 2 queries and their total elapsed times
> > against
> > > a
> > > > table with 310000 files:
> > > >     (A) SELECT  count(*) FROM table WHERE `date` = '2015-07-01';
> > > >           elapsed time: 980 secs
> > > >
> > > >     (B) SELECT count(*) FROM  `table/20150701` ;
> > > >           elapsed time: 54 secs
> > > >
> > > >     From the user perspective, both queries should perform nearly the
> > > same,
> > > > which was essentially the intent of DRILL-2517.
> > > >
> > > >
> > > > On Tue, Oct 27, 2015 at 12:04 PM, Steven Phillips <steven@dremio.com
> >
> > > > wrote:
> > > >
> > > > > I think we need to come up with a way to push partition pruning to
> > > > > execution time.  The other solutions may relive the problem in some
> > > > cases,
> > > > > but won't solve the fundamental problem.
> > > > >
> > > > > For example, even if we do figure out how to use multiple threads
> for
> > > > > reading the metadata, that may be fine for a couple hundred
> thousand
> > > > files,
> > > > > but what about when we have millions or tens of millions of files.
> It
> > > > will
> > > > > still be a huge bottle neck.
> > > > >
> > > > > I actually think we should use the Drill execution engine to probe
> > the
> > > > > metadata and generate the work assignments. We could have an
> > additional
> > > > > fragment or fragments of the query that would recursively probe the
> > > > > filesystem, read the metadata, and make assignments, and then pipe
> > the
> > > > > results into the Scanners, which will create readers on the fly.
> This
> > > way
> > > > > the query could actually begin doing work before the metadata has
> > even
> > > > been
> > > > > fully read.
> > > > >
> > > > > On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau <
> jacques@dremio.com>
> > > > > wrote:
> > > > >
> > > > > > My first thought is we've gotten too generous in what we're
> storing
> > > in
> > > > > the
> > > > > > Parquet metadata file. Early implementations were very lean
and
> it
> > > > seems
> > > > > > far larger today. For example, early implementations didn't
keep
> > > > > statistics
> > > > > > and ignored row groups (files, schema and block locations only).
> If
> > > we
> > > > > need
> > > > > > multiple levels of information, we may want to stagger (or
> > normalize)
> > > > > them
> > > > > > in the file. Also, we may think about what is the minimum that
> must
> > > be
> > > > > done
> > > > > > in planning. We could do the file pruning at execution time
> rather
> > > than
> > > > > > single-tracking these things (makes stats harder though).
> > > > > >
> > > > > > I also think we should be cautious around jumping to a conclusion
> > > until
> > > > > > DRILL-3973 provides more insight.
> > > > > >
> > > > > > In terms of caching, I'd be more inclined to rely on file system
> > > > caching
> > > > > > and make sure serialization/deserialization is as efficient
as
> > > possible
> > > > > as
> > > > > > opposed to implementing an application-level cache. (We already
> > have
> > > > > enough
> > > > > > problems managing memory without having to figure out when we
> > should
> > > > > drop a
> > > > > > metadata cache :D).
> > > > > >
> > > > > > Aside, I always liked this post for entertainment and the
> thoughts
> > on
> > > > > > virtual memory:
> > > https://www.varnish-cache.org/trac/wiki/ArchitectNotes
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jacques Nadeau
> > > > > > CTO and Co-Founder, Dremio
> > > > > >
> > > > > > On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes <
> hgunes@maprtech.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > One more thing, for workloads running queries over subsets
of
> > same
> > > > > > parquet
> > > > > > > files, we can consider maintaining an in-memory cache as
well.
> > > > Assuming
> > > > > > > metadata memory footprint per file is low and parquet files
are
> > > > static,
> > > > > > not
> > > > > > > needing us to invalidate the cache often.
> > > > > > >
> > > > > > > H+
> > > > > > >
> > > > > > > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes <
> > hgunes@maprtech.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > I am not familiar with the contents of metadata stored
but if
> > > > > > > > deserialization workload seems to be fitting to any
of
> > > > afterburner's
> > > > > > > > claimed improvement points [1] It could well be worth
trying
> > > given
> > > > > the
> > > > > > > > claimed gain on throughput is substantial.
> > > > > > > >
> > > > > > > > It could also be a good idea to partition caching
over a
> number
> > > of
> > > > > > files
> > > > > > > > for better parallelization given number of cache files
> > generated
> > > is
> > > > > > > > *significantly* less than number of parquet files.
> Maintaining
> > > > global
> > > > > > > > statistics seems an improvement point too.
> > > > > > > >
> > > > > > > >
> > > > > > > > -H+
> > > > > > > >
> > > > > > > > 1:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized
> > > > > > > >
> > > > > > > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha <
> > > amansinha@apache.org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Forgot to include the link for Jackson's AfterBurner
module:
> > > > > > > >>   https://github.com/FasterXML/jackson-module-afterburner
> > > > > > > >>
> > > > > > > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <
> > > amansinha@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> > I was going to file an enhancement JIRA but
thought I will
> > > > discuss
> > > > > > > here
> > > > > > > >> > first:
> > > > > > > >> >
> > > > > > > >> > The parquet metadata cache file is a JSON
file that
> > contains a
> > > > > > subset
> > > > > > > of
> > > > > > > >> > the metadata extracted from the parquet files.
 The cache
> > file
> > > > can
> > > > > > get
> > > > > > > >> > really large .. a few GBs for a few hundred
thousand
> files.
> > > > > > > >> > I have filed a separate JIRA: DRILL-3973
for profiling the
> > > > various
> > > > > > > >> aspects
> > > > > > > >> > of planning including metadata operations.
 In the
> meantime,
> > > the
> > > > > > > >> timestamps
> > > > > > > >> > in the drillbit.log output indicate a large
chunk of time
> > > spent
> > > > in
> > > > > > > >> creating
> > > > > > > >> > the drill table to begin with, which indicates
bottleneck
> in
> > > > > reading
> > > > > > > the
> > > > > > > >> > metadata.  (I can provide performance numbers
later once
> we
> > > > > confirm
> > > > > > > >> through
> > > > > > > >> > profiling).
> > > > > > > >> >
> > > > > > > >> > A few thoughts around improvements:
> > > > > > > >> >  - The jackson deserialization of the JSON
file is very
> > slow..
> > > > can
> > > > > > > this
> > > > > > > >> be
> > > > > > > >> > speeded up ? .. for instance the AfterBurner
module of
> > jackson
> > > > > > claims
> > > > > > > to
> > > > > > > >> > improve performance by 30-40% by avoiding
the use of
> > > reflection.
> > > > > > > >> >  - The cache file read is a single threaded
process.  If
> we
> > > were
> > > > > > > >> directly
> > > > > > > >> > reading from parquet files, we use a default
of 16
> threads.
> > > > What
> > > > > > can
> > > > > > > be
> > > > > > > >> > done to parallelize the read ?
> > > > > > > >> >  - Any operation that can be done one time
during the
> > REFRESH
> > > > > > METADATA
> > > > > > > >> > command ?  for instance..examining the min/max
values to
> > > > determine
> > > > > > > >> > single-value for partition column could be
eliminated if
> we
> > do
> > > > > this
> > > > > > > >> > computation during REFRESH METADATA command
and store the
> > > > summary
> > > > > > one
> > > > > > > >> time.
> > > > > > > >> >
> > > > > > > >> >  - A pertinent question is: should the cache
file be
> stored
> > > in a
> > > > > > more
> > > > > > > >> > efficient format such as Parquet instead
of JSON ?
> > > > > > > >> >
> > > > > > > >> > Aman
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message