drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
Date Fri, 01 Sep 2017 00:18:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149825#comment-16149825
] 

Paul Rogers commented on DRILL-5758:
------------------------------------

Turns out the {{RecordBatchSizer}} contained a bug for repeated elements. Consider the original
output:

{code}
  rms.mapvalue.col2(type: REPEATED BIGINT, count: 1, total entries: 1, per-array: 1, std size:
8, actual size: 52, data size: 52)
...
  Records: 4096, Total size: 1441792, Data size: 376615, Gross row width: 352, Net row width:
92, Density: 27}
{code}

In the above, {{col2}} is repeated, but the entries per array is set at 1.

Output after the fix:

{code}
  rms.mapvalue.col2(type: REPEATED BIGINT, count: 4096, elements: 12288, per-array: 3, std
size: 8, actual size: 28, data size: 114688)
...
  Records: 4096, Total size: 1441792, Data size: 1136848, Gross row width: 352, Net row width:
278, Density: 79}
{code}

Note that the (average) elements per-array is now 3 and the estimated "net" row width has
grown from 92 to 278.

The result is much better vector size estimates and no vector reallocations:

{code}
Initial output batch allocation: 811008 bytes, 3771 records
<Note no vector resizes here.>
Took 4438 us to merge 3771 records, consuming 811008 bytes of memory
{code}

And now the sort completes:
{code}
Results: 4,000,000 records, 63 batches
{code}

> Rollup of external sort fixes to issues found by QA
> ---------------------------------------------------
>
>                 Key: DRILL-5758
>                 URL: https://issues.apache.org/jira/browse/DRILL-5758
>             Project: Apache Drill
>          Issue Type: Task
>    Affects Versions: 1.12.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA entries. Bugs fixed
in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message