drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kunal Khatua (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5267) Managed external sort spills too often with Parquet data
Date Tue, 21 Mar 2017 00:25:41 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933846#comment-15933846
] 

Kunal Khatua commented on DRILL-5267:
-------------------------------------

[~paul-rogers] Does [~rkins] need to define tests for this specifically? How do we verify
that the issue is fixed? The fix appears to be from DRILL-5266 's PR. 

> Managed external sort spills too often with Parquet data
> --------------------------------------------------------
>
>                 Key: DRILL-5267
>                 URL: https://issues.apache.org/jira/browse/DRILL-5267
>             Project: Apache Drill
>          Issue Type: Sub-task
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.10.0
>
>
> DRILL-5266 describes how Parquet produces low-density record batches. The result of these
batches is that the external sort spills more frequently than it should because it sizes spill
files based on batch size, not data content of the batch. Since Parquet batches are 95% empty
space, the spill files end up far too small.
> Adjust the spill calculations based on actual data content, not the size of the overall
record batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message