drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-6071) Limit batch size for flatten operator
Date Fri, 26 Jan 2018 22:05:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341663#comment-16341663
] 

ASF GitHub Bot commented on DRILL-6071:
---------------------------------------

Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1091#discussion_r164233329
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/unit/TestOutputBatchSize.java
---
    @@ -0,0 +1,498 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + * <p/>
    + * http://www.apache.org/licenses/LICENSE-2.0
    + * <p/>
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.drill.exec.physical.unit;
    +
    +import com.google.common.collect.Lists;
    +import org.apache.drill.common.expression.SchemaPath;
    +
    +import org.apache.drill.exec.physical.base.AbstractBase;
    +import org.apache.drill.exec.physical.base.PhysicalOperator;
    +import org.apache.drill.exec.physical.config.FlattenPOP;
    +import org.apache.drill.exec.physical.impl.ScanBatch;
    +import org.apache.drill.exec.physical.impl.spill.RecordBatchSizer;
    +import org.apache.drill.exec.record.RecordBatch;
    +import org.apache.drill.exec.record.VectorAccessible;
    +import org.apache.drill.exec.util.JsonStringArrayList;
    +import org.apache.drill.exec.util.JsonStringHashMap;
    +import org.apache.drill.exec.util.Text;
    +import org.junit.Ignore;
    +import org.junit.Test;
    +
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.List;
    +
    +public class TestOutputBatchSize extends PhysicalOpUnitTestBase {
    --- End diff --
    
    I added a test case like this, testFlattenLargeRecords and there are bunch of other test
cases as well.
    All the tests are verifying the batch sizes and number of batches.


> Limit batch size for flatten operator
> -------------------------------------
>
>                 Key: DRILL-6071
>                 URL: https://issues.apache.org/jira/browse/DRILL-6071
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.12.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>            Priority: Major
>             Fix For: 1.13.0
>
>
> flatten currently uses an adaptive algorithm to control the outgoing batch size. 
>  While processing the input batch, it adjusts the number of records in outgoing batch
based on memory usage so far. Once memory usage exceeds the configured limit for a batch,
the algorithm becomes more proactive and adjusts the limit half way through and end of every
batch. All this periodic checking of memory usage is unnecessary overhead and impacts performance.
Also, we will know only after the fact.
> Instead, figure out how many rows should be there in the outgoing batch from incoming
batch.
>  The way to do that would be to figure out average row size of the outgoing batch and
based on that figure out how many rows can be there for a given amount of memory. value vectors
provide us the necessary information to be able to figure this out.
> Row count in output batch should be decided based on memory (with min 1 and max 64k
rows) and not hard coded (to 4K) in code. Memory for output batch should be configurable system
option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message