drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-6071) Limit batch size for flatten operator
Date Mon, 22 Jan 2018 04:32:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333862#comment-16333862

ASF GitHub Bot commented on DRILL-6071:

Github user paul-rogers commented on a diff in the pull request:

    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
    @@ -76,6 +76,9 @@ private ExecConstants() {
       public static final String SPILL_FILESYSTEM = "drill.exec.spill.fs";
       public static final String SPILL_DIRS = "drill.exec.spill.directories";
    +  public static final String OUTPUT_BATCH_SIZE = "drill.exec.memory.operator.output_batch_sizeinMB";
    --- End diff --
    I wonder if MB is too coarse. Maybe just make this the batch size in bytes or K. If this
was a config option, we could use HOCON syntax for MB and so on. Since it a session/system
option, I suppose it must be a simple number. For example, below, in the external sort batch
size, it is in bytes, isn't it?
    Further, this should be restricted to be a system option. Check with Tim on how to do
that; he and/or Jyothsna implemented a way to do that.

> Limit batch size for flatten operator
> -------------------------------------
>                 Key: DRILL-6071
>                 URL: https://issues.apache.org/jira/browse/DRILL-6071
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.12.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>            Priority: Major
>             Fix For: 1.13.0
> flatten currently uses an adaptive algorithm to control the outgoing batch size. 
> While processing the input batch, it adjusts the number of records in outgoing batch
based on memory usage so far. Once memory usage exceeds the configured limit for a batch,
the algorithm becomes more proactive and adjusts the limit half way through  and end of every
batch. All this periodic checking of memory usage is unnecessary overhead and impacts performance.
Also, we will know only after the fact. 
> Instead, figure out how many rows should be there in the outgoing batch from incoming
> The way to do that would be to figure out average row size of the outgoing batch and
based on that figure out how many rows can be there for a given amount of memory. value vectors
provide us the necessary information to be able to figure this out.

This message was sent by Atlassian JIRA

View raw message