Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 45B48200BB9 for ; Tue, 8 Nov 2016 00:01:02 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 4491D160AE0; Mon, 7 Nov 2016 23:01:02 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 6B915160AF9 for ; Tue, 8 Nov 2016 00:01:01 +0100 (CET) Received: (qmail 11055 invoked by uid 500); 7 Nov 2016 23:01:00 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 11008 invoked by uid 99); 7 Nov 2016 23:01:00 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Nov 2016 23:01:00 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 529E72C2A6E for ; Mon, 7 Nov 2016 23:01:00 +0000 (UTC) Date: Mon, 7 Nov 2016 23:01:00 +0000 (UTC) From: "Paul Rogers (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (DRILL-5008) Refactor, document and simplify ExternalSortBatch MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 07 Nov 2016 23:01:02 -0000 [ https://issues.apache.org/jira/browse/DRILL-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-5008: ------------------------------- Description: ExternalSortBatch provides a spillable sort operator for Drill. The code works fine, but can be a hard to follow and understand. Make the following changes to improve ease-of-use for developers: 1. Refactor the large methods into bite-sized chunks to aid understanding. 2. Provide additional explanation of the theory and operation of this operator. The memory limit cases for the spill-needed test seem inconsistent: For the test for in-memory sort: {code} long currentlyAvailable = popConfig.getMaxAllocation() - oAllocator.getAllocatedMemory(); {code} For reaching the memory limit: {code} oAllocator.getAllocatedMemory() > .95 * oAllocator.getLimit() {code} That is, one uses `oAllocator.getLimit` ("the current maximum limit this allocator imposes"), the other uses `popConfig.getMaxAllocation` ("The maximum memory this operator can allocate".) 3. The spill operation is based on an estimated record width, but the estimate is based on only the first record in the first batch. The record width is then used to ensure the spilled records fit within a memory limit. However, a single unusual record in the first position will throw off the calculation, possibly leading to excess memory use. Better would be to sample a set of records to determine an average. Or, to find a way to work within a memory limit without a record width calculation (since record width is just an intermediate toward computing memory use.) 4. Much code in ESB deals with building vector batches and schemas. However, ESB cannot handle schema changes. The only valid schema difference is the same field path in a different position in the vector array. Given this restriction, the code can be simplified (and sped up) by exploiting the fact that all batches are required to have the same conceptual schema (same set of fields, but perhaps in different vector order) and most probably, the same physical schema (same fields and same vector order.) Note that, because of the way that the `getValueVectorId()` method works, each lookup of a value vector is an O(n) operation, so that each remapping of vectors is O(n^2). was: ExternalSortBatch provides a spillable sort operator for Drill. The code works fine, but can be a hard to follow and understand. Make the following changes to improve ease-of-use for developers: 1. Refactor the large methods into bite-sized chunks to aid understanding. 2. Provide additional explanation of the theory and operation of this operator. The memory limit cases for the spill-needed test seem inconsistent: For the test for in-memory sort: {code} long currentlyAvailable = popConfig.getMaxAllocation() - oAllocator.getAllocatedMemory(); {code} For reaching the memory limit: {code} oAllocator.getAllocatedMemory() > .95 * oAllocator.getLimit() {code} That is, one uses `oAllocator.getLimit` ("the current maximum limit this allocator imposes"), the other uses `popConfig.getMaxAllocation` ("The maximum memory this operator can allocate".) 3. The spill operation is based on an estimated record width, but the estimate is based on only the first record in the first batch. The record width is then used to ensure the spilled records fit within a memory limit. However, a single unusual record in the first position will throw off the calculation, possibly leading to excess memory use. Better would be to sample a set of records to determine an average. Or, to find a way to work within a memory limit without a record width calculation (since record width is just an intermediate toward computing memory use.) 4. Much code in ESB deals with building vector batches and schemas. However, ESB cannot handle schema changes. The only valid schema difference is the same field path in a different position in the vector array. Given this restriction, the code can be simplified (and sped up) by exploiting the fact that all batches are required to have the same conceptual schema (same set of fields, but perhaps in different vector order) and most probably, the same physical schema (same fields and same vector order.) Note that, because of the way that the `getValueVectorId()` method works, each lookup of a value vector is an O(n) operation, so that each remapping of vectors is O(1/2 n^2). > Refactor, document and simplify ExternalSortBatch > ------------------------------------------------- > > Key: DRILL-5008 > URL: https://issues.apache.org/jira/browse/DRILL-5008 > Project: Apache Drill > Issue Type: Improvement > Affects Versions: 1.8.0 > Reporter: Paul Rogers > Priority: Minor > > ExternalSortBatch provides a spillable sort operator for Drill. The code works fine, but can be a hard to follow and understand. Make the following changes to improve ease-of-use for developers: > 1. Refactor the large methods into bite-sized chunks to aid understanding. > 2. Provide additional explanation of the theory and operation of this operator. > The memory limit cases for the spill-needed test seem inconsistent: > For the test for in-memory sort: > {code} > long currentlyAvailable = popConfig.getMaxAllocation() - oAllocator.getAllocatedMemory(); > {code} > For reaching the memory limit: > {code} > oAllocator.getAllocatedMemory() > .95 * oAllocator.getLimit() > {code} > That is, one uses `oAllocator.getLimit` ("the current maximum limit this allocator imposes"), the other uses `popConfig.getMaxAllocation` ("The maximum memory this operator can allocate".) > 3. The spill operation is based on an estimated record width, but the estimate is based on only the first record in the first batch. The record width is then used to ensure the spilled records fit within a memory limit. However, a single unusual record in the first position will throw off the calculation, possibly leading to excess memory use. > Better would be to sample a set of records to determine an average. Or, to find a way to work within a memory limit without a record width calculation (since record width is just an intermediate toward computing memory use.) > 4. Much code in ESB deals with building vector batches and schemas. However, ESB cannot handle schema changes. The only valid schema difference is the same field path in a different position in the vector array. Given this restriction, the code can be simplified (and sped up) by exploiting the fact that all batches are required to have the same conceptual schema (same set of fields, but perhaps in different vector order) and most probably, the same physical schema (same fields and same vector order.) Note that, because of the way that the `getValueVectorId()` method works, each lookup of a value vector is an O(n) operation, so that each remapping of vectors is O(n^2). -- This message was sent by Atlassian JIRA (v6.3.4#6332)