drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5207) Improve Parquet scan pipelining
Date Sat, 28 Jan 2017 00:44:25 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843712#comment-15843712
] 

ASF GitHub Bot commented on DRILL-5207:
---------------------------------------

Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/723#discussion_r98268204
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java
---
    @@ -41,26 +42,33 @@
     import java.io.IOException;
     import java.nio.ByteBuffer;
     import java.util.concurrent.Callable;
    +import java.util.concurrent.ConcurrentLinkedQueue;
     import java.util.concurrent.ExecutorService;
     import java.util.concurrent.Future;
    +import java.util.concurrent.LinkedBlockingQueue;
     import java.util.concurrent.TimeUnit;
     
     import static org.apache.parquet.column.Encoding.valueOf;
     
     class AsyncPageReader extends PageReader {
       static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AsyncPageReader.class);
     
    -
       private ExecutorService threadPool;
    -  private Future<ReadStatus> asyncPageRead;
    +  private long queueSize;
    +  private LinkedBlockingQueue<ReadStatus> pageQueue;
    +  private ConcurrentLinkedQueue<Future<Boolean>> asyncPageRead;
    --- End diff --
    
    I had a return value of false on failure but then I changed that to throw Exceptions and
left the boolean return value. Will change that


> Improve Parquet scan pipelining
> -------------------------------
>
>                 Key: DRILL-5207
>                 URL: https://issues.apache.org/jira/browse/DRILL-5207
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.9.0
>            Reporter: Parth Chandra
>            Assignee: Parth Chandra
>             Fix For: 1.10
>
>
> The parquet reader's async page reader is not quite efficiently pipelined. 
> The default size of the disk read buffer is 4MB while the page reader reads ~1MB at a
time. The Parquet decode is also processing 1MB at a time. This means the disk is idle while
the data is being processed. Reducing the buffer to 1MB will reduce the time the processing
thread waits for the disk read thread.
> Additionally, since the data to process a page may be more or less than 1MB, a queue
of pages will help so that the disk scan does not block (until the queue is full), waiting
for the processing thread.
> Additionally, the BufferedDirectBufInputStream class reads from disk as soon as it is
initialized. Since this is called at setup time, this increases the setup time for the query
and query execution does not begin until this is completed.
> There are a few other inefficiencies - options are read every time a page reader is created.
Reading options can be expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message