drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben-Zvi <...@git.apache.org>
Subject [GitHub] drill pull request #767: DRILL-5226: Managed external sort fixes
Date Fri, 03 Mar 2017 22:10:25 GMT
Github user Ben-Zvi commented on a diff in the pull request:

    https://github.com/apache/drill/pull/767#discussion_r104252291
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/ExternalSortBatch.java
---
    @@ -1333,8 +1339,43 @@ private void spillFromMemory() {
         mergeAndSpill(bufferedBatches, spillCount);
       }
     
    +  private void mergeRuns(int targetCount) {
    +
    +    // Determine the number of runs to merge. The count should be the
    +    // target count. However, to prevent possible memory overrun, we
    +    // double-check with actual spill batch size and only spill as much
    +    // as fits in the merge memory pool.
    +
    +    int mergeCount = 0;
    +    long mergeSize = 0;
    +    for (SpilledRun batch : spilledRuns) {
    +      long batchSize = batch.getBatchSize();
    +      if (mergeSize + batchSize > mergeMemoryPool) {
    +        break;
    +      }
    +      mergeSize += batchSize;
    +      mergeCount++;
    +      if (mergeCount == targetCount) {
    +        break;
    +      }
    +    }
    +
    +    // Must always spill at least 2, even if this creates an over-size
    +    // spill file. But, if this is a final consolidation, we may have only
    +    // a single batch.
    +
    +    mergeCount = Math.max(mergeCount, 2);
    +    mergeCount = Math.min(mergeCount, spilledRuns.size());
    +
    +    // Do the actual spill.
    +
    +    mergeAndSpill(spilledRuns, mergeCount);
    --- End diff --
    
    Just a comment: So we always merge the _FIRST_ mergeCount runs. So if (one of) the first
run has some crazy "oversized" batch, we'd repeatedly merge a small number of runs, as that
bad batch may be preserved on.
    Alternatively - select the runs with the smaller "max batch"es, hence getting more runs
to merge.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message