drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5457) Support Spill to Disk for the Hash Aggregate Operator
Date Sat, 27 May 2017 05:39:05 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027216#comment-16027216

ASF GitHub Bot commented on DRILL-5457:

Github user paul-rogers commented on a diff in the pull request:

    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
    @@ -18,82 +18,161 @@
     package org.apache.drill.exec.physical.impl.aggregate;
     import java.io.IOException;
    +import java.io.InputStream;
    +import java.io.OutputStream;
     import java.util.ArrayList;
     import java.util.Iterator;
     import java.util.List;
    +import java.util.concurrent.TimeUnit;
     import javax.inject.Named;
    +import com.google.common.base.Stopwatch;
    +import org.apache.drill.common.exceptions.DrillRuntimeException;
     import org.apache.drill.common.expression.ErrorCollector;
     import org.apache.drill.common.expression.ErrorCollectorImpl;
     import org.apache.drill.common.expression.ExpressionPosition;
     import org.apache.drill.common.expression.FieldReference;
     import org.apache.drill.common.expression.LogicalExpression;
    +import org.apache.drill.common.expression.SchemaPath;
    +import org.apache.drill.exec.ExecConstants;
    +import org.apache.drill.exec.cache.VectorAccessibleSerializable;
     import org.apache.drill.exec.compile.sig.RuntimeOverridden;
     import org.apache.drill.exec.exception.ClassTransformationException;
    +import org.apache.drill.exec.exception.OutOfMemoryException;
     import org.apache.drill.exec.exception.SchemaChangeException;
     import org.apache.drill.exec.expr.TypeHelper;
     import org.apache.drill.exec.memory.BufferAllocator;
     import org.apache.drill.exec.ops.FragmentContext;
     import org.apache.drill.exec.ops.MetricDef;
    +import org.apache.drill.exec.ops.OperatorContext;
     import org.apache.drill.exec.ops.OperatorStats;
     import org.apache.drill.exec.physical.config.HashAggregate;
     import org.apache.drill.exec.physical.impl.common.ChainedHashTable;
     import org.apache.drill.exec.physical.impl.common.HashTable;
     import org.apache.drill.exec.physical.impl.common.HashTableConfig;
     import org.apache.drill.exec.physical.impl.common.HashTableStats;
     import org.apache.drill.exec.physical.impl.common.IndexPointer;
    +import org.apache.drill.exec.physical.impl.spill.RecordBatchSizer;
    +import org.apache.drill.exec.physical.impl.spill.SpillSet;
    +import org.apache.drill.exec.planner.physical.AggPrelBase;
    +import org.apache.drill.exec.proto.UserBitShared;
    +import org.apache.drill.exec.record.CloseableRecordBatch;
     import org.apache.drill.exec.record.MaterializedField;
     import org.apache.drill.exec.record.RecordBatch;
    -import org.apache.drill.exec.record.RecordBatch.IterOutcome;
    -import org.apache.drill.exec.record.TypedFieldId;
    +import org.apache.drill.exec.record.BatchSchema;
     import org.apache.drill.exec.record.VectorContainer;
    +import org.apache.drill.exec.record.TypedFieldId;
    +import org.apache.drill.exec.record.RecordBatch.IterOutcome;
     import org.apache.drill.exec.record.VectorWrapper;
    +import org.apache.drill.exec.record.WritableBatch;
    +import org.apache.drill.exec.record.selection.SelectionVector2;
    +import org.apache.drill.exec.record.selection.SelectionVector4;
     import org.apache.drill.exec.vector.AllocationHelper;
     import org.apache.drill.exec.vector.FixedWidthVector;
     import org.apache.drill.exec.vector.ObjectVector;
     import org.apache.drill.exec.vector.ValueVector;
     import org.apache.drill.exec.vector.VariableWidthVector;
    +import org.apache.hadoop.fs.Path;
    +import static org.apache.drill.exec.record.RecordBatch.MAX_BATCH_SIZE;
    --- End diff --
    It is impressive how you were able to slide spilling into the existing code structure.
Performance and modularity are never required, of course, but it may be worth at least considering
    Putting so much code in a template has a large drawback. Our current byte-code based code
generation performs at its worst when templates are large. This template is the base for the
generated code. In traditional Java, the size of a subclass is independent of the size of
the superclass. So, if we used "plain old" Java, the size of this template would have no performance
    But, with byte-code manipulation, each generated class contains a complete copy of the
byte code for the template class. With a huge template, we make a huge copy every time. We
pay a cost in terms of the time it takes to make the copy, then analyze the resulting byte
codes. Also, we fill up the code cache with many copies of the same code.
    Three solutions.
    1. Ignore the problem. (Which is probably the right choice until performance becomes a
    2. Refactor the code so that the bulk of logic is in a non-template class, with only a
thin layer of code in the template.
    3. Use "plain old" Java compilation for this class to avoid the many large copies described
    The other problem is that this class has so many state variables that full testing and
understanding will be hard. There is value in smaller chunks to reduce the cost of testing
and maintenance.

> Support Spill to Disk for the Hash Aggregate Operator
> -----------------------------------------------------
>                 Key: DRILL-5457
>                 URL: https://issues.apache.org/jira/browse/DRILL-5457
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>    Affects Versions: 1.10.0
>            Reporter: Boaz Ben-Zvi
>            Assignee: Boaz Ben-Zvi
>             Fix For: 1.11.0
> Support gradual spilling memory to disk as the available memory gets too small to allow
in memory work for the Hash Aggregate Operator.

This message was sent by Atlassian JIRA

View raw message