drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From parthchandra <...@git.apache.org>
Subject [GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance
Date Wed, 12 Oct 2016 20:46:32 GMT
GitHub user parthchandra opened a pull request:

    https://github.com/apache/drill/pull/611

    Drill-4800: Improve parquet reader performance

    Added a Buffering input stream
    Updated parquet reader to optionally use the buffering input stream
    Added optional asynchronous reading of page data
    Added optional parallel decompression and decoding of columns
        Decompression of data using Gzip/Snappy bypasses the Parquet APIs and calls the decompressors
directly (there were concurrency issues with using the Parquet APIs)
    Added new operator metrics for asynchronous page reading.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/parthchandra/drill DRILL-4800

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/611.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #611
    
----
commit 0457d69cae403bc8abcebb90ead55769ec58f5ac
Author: Parth Chandra <parthc@apache.org>
Date:   2016-06-10T21:56:41Z

    DRILL-4800: Use a buffering input stream in the Parquet reader

commit a33200107a5180f1b0dbad2b2e5b0905de4ed884
Author: Parth Chandra <parthc@apache.org>
Date:   2016-08-24T17:46:37Z

    DRILL-4800: Parallelize column reading.
      Read/Decode fixed width fields in parallel
      Decoding var length columns in parallel
      Use simplified decompress method for Gzip and Snappy decompression. Avoids concurrency
issue with Parquet decompression. (It's also faster).
      Stress test Parquet read write
      Parallel column reader is disabled by default (may perform less well under higher concurrency)

commit 8d9c26071b4826bda917ac4e88c70b7351a16d83
Author: Parth Chandra <parthc@apache.org>
Date:   2016-09-27T21:03:35Z

    DRILL-4800: Add AsyncPageReader to pipeline PageRead
      Use non tracking input stream for Parquet scans.
      Make choice between async and sync reader configurable.
      Make various options user configurable - choose between sync and async page reader,
enable/disable fadvise
      Add Parquet Scan metrics to track time spent in various operations

commit 91658f0cb3bb2ee3ff35a0ffde859052df91527e
Author: Parth Chandra <parthc@apache.org>
Date:   2016-09-14T04:47:49Z

    DRILL-4800: Various fixes.
     Fix buffer underflow exception in BufferedDirectBufInputStream.
     Fix writer index for in64 dictionary encoded types.
     Added logging to help debug.
     Fix memory leaks.
     Work around issues with of InputStream.available() ( Do not use hasRemainder; Remove
check for EOF in BufferedDirectBufInputStream.read() ).
     Finalize defaults.
     Remove commented code.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message