drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-5366) Use generic copier for wide rows in external sort
Date Sat, 18 Mar 2017 22:00:43 GMT

     [ https://issues.apache.org/jira/browse/DRILL-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Paul Rogers updated DRILL-5366:
-------------------------------
    Component/s:     (was: Tools, Build & Test)

> Use generic copier for wide rows in external sort
> -------------------------------------------------
>
>                 Key: DRILL-5366
>                 URL: https://issues.apache.org/jira/browse/DRILL-5366
>             Project: Apache Drill
>          Issue Type: Sub-task
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.11.0
>
>
> The external sort makes use of a "priority copier" to copy rows at two times:
> * When merging data during spilling
> * When merging data for an in-memory sort
> As with all such Drill operators, the code works by generating two local variables per
column, then generating two blocks of code per column (one for setup, one for the actual copy.)
> This works fine for rows with few columns. But, in rows with many columns (such as for
queries against JSON documents), the amount of code produced becomes very large. This introduces
extra overhead to generate, compile and store the extra code.
> DRILL-5125 found the same issue in the Selection Vector remover. By applying the fix
from that ticket to the external sort, we reap 24% savings.
> Consider a unit test that runs only the copier. Create a single record batch with 1000
columns and 64K rows. Use the copier to produce a set of smaller output batches. Such a test
factors out all the overhead of running a query.
> * Run time for a generated copier: 17 secs.
> * Run time with the generic copier: 13 secs
> * Savings: 4 seconds or 24%.
> (13 seconds is still a very long time to process 64K rows. There may be optimizations
to be had in the priority queue implementation as well, but that is a separate issue.)
> To be conservative, provide a config option to enable the feature, perhaps by setting
a threshold of the number of columns that must be present to use the generic version. That
way, if folks feel that the generated version is faster for narrow rows, the generated version
can be used. And each user can decide the point at which the costs of bulky code outweighs
the performance costs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message