drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Phillips <ste...@dremio.com>
Subject Re: [DISCUSS] Ideas to improve metadata cache read performance
Date Thu, 29 Oct 2015 18:33:30 GMT
I agree that this would present a small challenge for testing, but I don't
think ease of testing should be the primary motivator in designing the
software. Once we've decided what we want the software to do, then we can
work together to figure out how to test it.

On Thu, Oct 29, 2015 at 11:09 AM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> @steven If we end up pushing the partition pruning to the execution phase,
> how would we know that partition pruning even took place. I am thinking
> from the standpoint of adding functional tests around partition pruning.
>
> - Rahul
>
> On Wed, Oct 28, 2015 at 10:53 AM, Parth Chandra <parthc@apache.org> wrote:
>
> > And ideally, I suppose, the merged schema would correspond to the
> > information that we want to keep in a .drill file.
> >
> >
> > On Tue, Oct 27, 2015 at 4:55 PM, Aman Sinha <asinha@maprtech.com> wrote:
> >
> > > @Steven, w.r.t to your suggestion about doing the metadata operation
> > during
> > > execution phase, see the related discussion in DRILL-3838.
> > >
> > > A couple of more thoughts:
> > >  - Parth and I were discussing keeping track of the merged schema as
> part
> > > of the refresh metadata and storing the merged schema for all files
> that
> > > have the identical schema (currently this is repeated and is a huge
> > > contributor to the size of the file).   To Jacques' point about keeping
> > > minimum information needed for planning purposes,  we certainly could
> do
> > a
> > > better job in keeping it lean.   The row count of the table could be
> > > computed at the time of running refresh metadata command.  Similarly
> the
> > > analysis of single-value can be done at that time instead of on a
> > per-query
> > > basis.
> > >
> > >  - We should revisit DRILL-2517(
> > > https://issues.apache.org/jira/browse/DRILL-2517)
> > >   Consider the following 2 queries and their total elapsed times
> against
> > a
> > > table with 310000 files:
> > >     (A) SELECT  count(*) FROM table WHERE `date` = '2015-07-01';
> > >           elapsed time: 980 secs
> > >
> > >     (B) SELECT count(*) FROM  `table/20150701` ;
> > >           elapsed time: 54 secs
> > >
> > >     From the user perspective, both queries should perform nearly the
> > same,
> > > which was essentially the intent of DRILL-2517.
> > >
> > >
> > > On Tue, Oct 27, 2015 at 12:04 PM, Steven Phillips <steven@dremio.com>
> > > wrote:
> > >
> > > > I think we need to come up with a way to push partition pruning to
> > > > execution time.  The other solutions may relive the problem in some
> > > cases,
> > > > but won't solve the fundamental problem.
> > > >
> > > > For example, even if we do figure out how to use multiple threads for
> > > > reading the metadata, that may be fine for a couple hundred thousand
> > > files,
> > > > but what about when we have millions or tens of millions of files. It
> > > will
> > > > still be a huge bottle neck.
> > > >
> > > > I actually think we should use the Drill execution engine to probe
> the
> > > > metadata and generate the work assignments. We could have an
> additional
> > > > fragment or fragments of the query that would recursively probe the
> > > > filesystem, read the metadata, and make assignments, and then pipe
> the
> > > > results into the Scanners, which will create readers on the fly. This
> > way
> > > > the query could actually begin doing work before the metadata has
> even
> > > been
> > > > fully read.
> > > >
> > > > On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau <jacques@dremio.com>
> > > > wrote:
> > > >
> > > > > My first thought is we've gotten too generous in what we're storing
> > in
> > > > the
> > > > > Parquet metadata file. Early implementations were very lean and it
> > > seems
> > > > > far larger today. For example, early implementations didn't keep
> > > > statistics
> > > > > and ignored row groups (files, schema and block locations only).
If
> > we
> > > > need
> > > > > multiple levels of information, we may want to stagger (or
> normalize)
> > > > them
> > > > > in the file. Also, we may think about what is the minimum that must
> > be
> > > > done
> > > > > in planning. We could do the file pruning at execution time rather
> > than
> > > > > single-tracking these things (makes stats harder though).
> > > > >
> > > > > I also think we should be cautious around jumping to a conclusion
> > until
> > > > > DRILL-3973 provides more insight.
> > > > >
> > > > > In terms of caching, I'd be more inclined to rely on file system
> > > caching
> > > > > and make sure serialization/deserialization is as efficient as
> > possible
> > > > as
> > > > > opposed to implementing an application-level cache. (We already
> have
> > > > enough
> > > > > problems managing memory without having to figure out when we
> should
> > > > drop a
> > > > > metadata cache :D).
> > > > >
> > > > > Aside, I always liked this post for entertainment and the thoughts
> on
> > > > > virtual memory:
> > https://www.varnish-cache.org/trac/wiki/ArchitectNotes
> > > > >
> > > > >
> > > > > --
> > > > > Jacques Nadeau
> > > > > CTO and Co-Founder, Dremio
> > > > >
> > > > > On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes <hgunes@maprtech.com
> >
> > > > wrote:
> > > > >
> > > > > > One more thing, for workloads running queries over subsets of
> same
> > > > > parquet
> > > > > > files, we can consider maintaining an in-memory cache as well.
> > > Assuming
> > > > > > metadata memory footprint per file is low and parquet files
are
> > > static,
> > > > > not
> > > > > > needing us to invalidate the cache often.
> > > > > >
> > > > > > H+
> > > > > >
> > > > > > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes <
> hgunes@maprtech.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > I am not familiar with the contents of metadata stored
but if
> > > > > > > deserialization workload seems to be fitting to any of
> > > afterburner's
> > > > > > > claimed improvement points [1] It could well be worth trying
> > given
> > > > the
> > > > > > > claimed gain on throughput is substantial.
> > > > > > >
> > > > > > > It could also be a good idea to partition caching over
a number
> > of
> > > > > files
> > > > > > > for better parallelization given number of cache files
> generated
> > is
> > > > > > > *significantly* less than number of parquet files. Maintaining
> > > global
> > > > > > > statistics seems an improvement point too.
> > > > > > >
> > > > > > >
> > > > > > > -H+
> > > > > > >
> > > > > > > 1:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized
> > > > > > >
> > > > > > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha <
> > amansinha@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > >> Forgot to include the link for Jackson's AfterBurner
module:
> > > > > > >>   https://github.com/FasterXML/jackson-module-afterburner
> > > > > > >>
> > > > > > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <
> > amansinha@apache.org
> > > >
> > > > > > wrote:
> > > > > > >>
> > > > > > >> > I was going to file an enhancement JIRA but thought
I will
> > > discuss
> > > > > > here
> > > > > > >> > first:
> > > > > > >> >
> > > > > > >> > The parquet metadata cache file is a JSON file
that
> contains a
> > > > > subset
> > > > > > of
> > > > > > >> > the metadata extracted from the parquet files.
 The cache
> file
> > > can
> > > > > get
> > > > > > >> > really large .. a few GBs for a few hundred thousand
files.
> > > > > > >> > I have filed a separate JIRA: DRILL-3973 for profiling
the
> > > various
> > > > > > >> aspects
> > > > > > >> > of planning including metadata operations.  In
the meantime,
> > the
> > > > > > >> timestamps
> > > > > > >> > in the drillbit.log output indicate a large chunk
of time
> > spent
> > > in
> > > > > > >> creating
> > > > > > >> > the drill table to begin with, which indicates
bottleneck in
> > > > reading
> > > > > > the
> > > > > > >> > metadata.  (I can provide performance numbers
later once we
> > > > confirm
> > > > > > >> through
> > > > > > >> > profiling).
> > > > > > >> >
> > > > > > >> > A few thoughts around improvements:
> > > > > > >> >  - The jackson deserialization of the JSON file
is very
> slow..
> > > can
> > > > > > this
> > > > > > >> be
> > > > > > >> > speeded up ? .. for instance the AfterBurner module
of
> jackson
> > > > > claims
> > > > > > to
> > > > > > >> > improve performance by 30-40% by avoiding the
use of
> > reflection.
> > > > > > >> >  - The cache file read is a single threaded process.
 If we
> > were
> > > > > > >> directly
> > > > > > >> > reading from parquet files, we use a default of
16 threads.
> > > What
> > > > > can
> > > > > > be
> > > > > > >> > done to parallelize the read ?
> > > > > > >> >  - Any operation that can be done one time during
the
> REFRESH
> > > > > METADATA
> > > > > > >> > command ?  for instance..examining the min/max
values to
> > > determine
> > > > > > >> > single-value for partition column could be eliminated
if we
> do
> > > > this
> > > > > > >> > computation during REFRESH METADATA command and
store the
> > > summary
> > > > > one
> > > > > > >> time.
> > > > > > >> >
> > > > > > >> >  - A pertinent question is: should the cache file
be stored
> > in a
> > > > > more
> > > > > > >> > efficient format such as Parquet instead of JSON
?
> > > > > > >> >
> > > > > > >> > Aman
> > > > > > >> >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message