drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hanifi Gunes <hgu...@maprtech.com>
Subject Re: [DISCUSS] Ideas to improve metadata cache read performance
Date Mon, 26 Oct 2015 21:25:03 GMT
One more thing, for workloads running queries over subsets of same parquet
files, we can consider maintaining an in-memory cache as well. Assuming
metadata memory footprint per file is low and parquet files are static, not
needing us to invalidate the cache often.

H+

On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes <hgunes@maprtech.com> wrote:

> I am not familiar with the contents of metadata stored but if
> deserialization workload seems to be fitting to any of afterburner's
> claimed improvement points [1] It could well be worth trying given the
> claimed gain on throughput is substantial.
>
> It could also be a good idea to partition caching over a number of files
> for better parallelization given number of cache files generated is
> *significantly* less than number of parquet files. Maintaining global
> statistics seems an improvement point too.
>
>
> -H+
>
> 1:
> https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized
>
> On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha <amansinha@apache.org> wrote:
>
>> Forgot to include the link for Jackson's AfterBurner module:
>>   https://github.com/FasterXML/jackson-module-afterburner
>>
>> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <amansinha@apache.org> wrote:
>>
>> > I was going to file an enhancement JIRA but thought I will discuss here
>> > first:
>> >
>> > The parquet metadata cache file is a JSON file that contains a subset of
>> > the metadata extracted from the parquet files.  The cache file can get
>> > really large .. a few GBs for a few hundred thousand files.
>> > I have filed a separate JIRA: DRILL-3973 for profiling the various
>> aspects
>> > of planning including metadata operations.  In the meantime, the
>> timestamps
>> > in the drillbit.log output indicate a large chunk of time spent in
>> creating
>> > the drill table to begin with, which indicates bottleneck in reading the
>> > metadata.  (I can provide performance numbers later once we confirm
>> through
>> > profiling).
>> >
>> > A few thoughts around improvements:
>> >  - The jackson deserialization of the JSON file is very slow.. can this
>> be
>> > speeded up ? .. for instance the AfterBurner module of jackson claims to
>> > improve performance by 30-40% by avoiding the use of reflection.
>> >  - The cache file read is a single threaded process.  If we were
>> directly
>> > reading from parquet files, we use a default of 16 threads.  What can be
>> > done to parallelize the read ?
>> >  - Any operation that can be done one time during the REFRESH METADATA
>> > command ?  for instance..examining the min/max values to determine
>> > single-value for partition column could be eliminated if we do this
>> > computation during REFRESH METADATA command and store the summary one
>> time.
>> >
>> >  - A pertinent question is: should the cache file be stored in a more
>> > efficient format such as Parquet instead of JSON ?
>> >
>> > Aman
>> >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message