Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 05AE4200C4E for ; Thu, 6 Apr 2017 22:22:46 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 04563160B91; Thu, 6 Apr 2017 20:22:46 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5CC2F160B9F for ; Thu, 6 Apr 2017 22:22:45 +0200 (CEST) Received: (qmail 7726 invoked by uid 500); 6 Apr 2017 20:22:44 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 7716 invoked by uid 99); 6 Apr 2017 20:22:44 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Apr 2017 20:22:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 19DE3189F88 for ; Thu, 6 Apr 2017 20:22:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id FE1KFE4yZpNp for ; Thu, 6 Apr 2017 20:22:43 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 015325FC73 for ; Thu, 6 Apr 2017 20:22:43 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 67F9EE0B08 for ; Thu, 6 Apr 2017 20:22:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id C10E92406B for ; Thu, 6 Apr 2017 20:22:41 +0000 (UTC) Date: Thu, 6 Apr 2017 20:22:41 +0000 (UTC) From: "Paul Rogers (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (DRILL-5366) Use generic copier for wide rows in external sort MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 06 Apr 2017 20:22:46 -0000 [ https://issues.apache.org/jira/browse/DRILL-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers reassigned DRILL-5366: ---------------------------------- Assignee: (was: Paul Rogers) Fix Version/s: (was: 1.11.0) > Use generic copier for wide rows in external sort > ------------------------------------------------- > > Key: DRILL-5366 > URL: https://issues.apache.org/jira/browse/DRILL-5366 > Project: Apache Drill > Issue Type: Sub-task > Affects Versions: 1.10.0 > Reporter: Paul Rogers > > The external sort makes use of a "priority copier" to copy rows at two times: > * When merging data during spilling > * When merging data for an in-memory sort > As with all such Drill operators, the code works by generating two local variables per column, then generating two blocks of code per column (one for setup, one for the actual copy.) > This works fine for rows with few columns. But, in rows with many columns (such as for queries against JSON documents), the amount of code produced becomes very large. This introduces extra overhead to generate, compile and store the extra code. > DRILL-5125 found the same issue in the Selection Vector remover. By applying the fix from that ticket to the external sort, we reap 24% savings. > Consider a unit test that runs only the copier. Create a single record batch with 1000 columns and 64K rows. Use the copier to produce a set of smaller output batches. Such a test factors out all the overhead of running a query. > * Run time for a generated copier: 17 secs. > * Run time with the generic copier: 13 secs > * Savings: 4 seconds or 24%. > (13 seconds is still a very long time to process 64K rows. There may be optimizations to be had in the priority queue implementation as well, but that is a separate issue.) > To be conservative, provide a config option to enable the feature, perhaps by setting a threshold of the number of columns that must be present to use the generic version. That way, if folks feel that the generated version is faster for narrow rows, the generated version can be used. And each user can decide the point at which the costs of bulky code outweighs the performance costs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)