drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5635) External sort need not buffer two incoming batches; just two spill files' worth
Date Fri, 30 Jun 2017 05:01:00 GMT
Paul Rogers created DRILL-5635:

             Summary: External sort need not buffer two incoming batches; just two spill files'
                 Key: DRILL-5635
                 URL: https://issues.apache.org/jira/browse/DRILL-5635
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.10.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers
            Priority: Minor

The external sort has long had the rule that it must have space in memory to buffer two incoming
batches. This rule stems from the merge/sort algorithm that requires a minimum of two *records*
per file to make progress.

But, Drill is batch-based, not record-based. Drill must spill at least two records (preferably
much more) to make progress.

The original code confused these issues and required room for two incoming batches. This restriction
can be softened. In the worst case, Drill only needs two records: either one batch of (at
least) two records, or two batches of one record each.

The difference in rule will better handle the case of very large incoming batches relative
to memory available for sort. Suppose an incoming batch is 500 MB in size and the sort has
750 MB available. The code today fails the sort because two batches cannot fit in memory.
But, the sort, by default, creates spill files of 250 MB. Since a single input batch can fill
two spill files, there is no reason to wait for a second batch before spilling.

Special care is needed to handle this case. Divide the "jumbo" batch into spill files of near-equal
size, but each must contain at least two records.

The only limit occurs if individual records (assuming one record per batch) exceed 25% of
memory. (Because, today, the sort copies records prior to spilling. Another JIRA will address
that fault.)

This message was sent by Atlassian JIRA

View raw message