Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7EBFD200BBD for ; Tue, 8 Nov 2016 19:09:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 7D7DF160B0A; Tue, 8 Nov 2016 18:09:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C50BE160AD0 for ; Tue, 8 Nov 2016 19:09:04 +0100 (CET) Received: (qmail 462 invoked by uid 500); 8 Nov 2016 18:08:59 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 422 invoked by uid 99); 8 Nov 2016 18:08:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Nov 2016 18:08:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id CFF822C2AB8 for ; Tue, 8 Nov 2016 18:08:58 +0000 (UTC) Date: Tue, 8 Nov 2016 18:08:58 +0000 (UTC) From: "Paul Rogers (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (DRILL-5013) Heap allocation, data copies in UDLE write path for ExternalSortBatch MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 08 Nov 2016 18:09:05 -0000 Paul Rogers created DRILL-5013: ---------------------------------- Summary: Heap allocation, data copies in UDLE write path for ExternalSortBatch Key: DRILL-5013 URL: https://issues.apache.org/jira/browse/DRILL-5013 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.8.0 Reporter: Paul Rogers Priority: Minor The ExternalSortBatch (ESB) uses spill-to-disk to sort a large collection of records within a limited memory footprint. As part of writing data to disk, ESB writes each of a target byte buffer to disk. Since the vector is stored in direct memory (not visible to an output stream), the code path first makes a temporary on-heap copy. In particular the code in `io.netty.buffer.PooledUnsafeDirectByteBuf` does the following: {code} @Override public ByteBuf getBytes(int index, OutputStream out, int length) throws IOException { checkIndex(index, length); if (length != 0) { byte[] tmp = new byte[length]; PlatformDependent.copyMemory(addr(index), tmp, 0, length); out.write(tmp); } return this; } {code} The result is that we 1) create a large number of on-heap objects, and 2) copy the data twice: once from direct memory to the tmp buffer, and from the tmp buffer into the output stream's own buffer. Two optimizations are possible: 1. Copy the data byte-by-byte from the direct memory buffer to the output stream, or 2. Reuse the same tmp buffer across vector writes. Since the code is in Netty, if we do either of the above, we'd have to write our own "getBytes" (misnomer, really write bytes) method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)