drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5266) Parquet Reader produces "low density" record batches - bits vs. bytes
Date Thu, 23 Feb 2017 14:46:44 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880575#comment-15880575
] 

ASF GitHub Bot commented on DRILL-5266:
---------------------------------------

Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/749#discussion_r102725896
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
---
    @@ -376,14 +378,14 @@ public void setup(OperatorContext operatorContext, OutputMutator
output) throws
           if (dataTypeLength == -1) {
               allFieldsFixedLength = false;
             } else {
    -        bitWidthAllFixedFields += dataTypeLength;
    +          bitWidthAllFixedFields += dataTypeLength;
             }
           }
     //    rowGroupOffset = footer.getBlocks().get(rowGroupIndex).getColumns().get(0).getFirstDataPageOffset();
     
         if (columnsToScan != 0  && allFieldsFixedLength) {
           recordsPerBatch = (int) Math.min(Math.min(batchSize / bitWidthAllFixedFields,
    -          footer.getBlocks().get(0).getColumns().get(0).getValueCount()), 65535);
    +          footer.getBlocks().get(0).getColumns().get(0).getValueCount()), DEFAULT_RECORDS_TO_READ_IF_VARIABLE_WIDTH);
    --- End diff --
    
    This is DEFAULT_RECORDS_TO_READ_FIXED_WIDTH (not VARIABLE)


> Parquet Reader produces "low density" record batches - bits vs. bytes
> ---------------------------------------------------------------------
>
>                 Key: DRILL-5266
>                 URL: https://issues.apache.org/jira/browse/DRILL-5266
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>
> Testing with the managed sort revealed that, for at least one file, Parquet produces
"low-density" batches: batches in which only 5% of each value vector contains actual data,
with the rest being unused space. When fed into the sort, we end up buffering 95% of wasted
space, using only 5% of available memory to hold actual query data. The result is poor performance
of the sort as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use estimates. The
following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
>   T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
>   c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, vector
size: 49152, data size: 30327, row capacity: 4095, density: 62)
>   Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message