Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 21F6B200C28 for ; Mon, 13 Mar 2017 18:15:46 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 208A1160B8E; Mon, 13 Mar 2017 17:15:46 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 6937D160B5D for ; Mon, 13 Mar 2017 18:15:45 +0100 (CET) Received: (qmail 846 invoked by uid 500); 13 Mar 2017 17:15:44 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 834 invoked by uid 99); 13 Mar 2017 17:15:44 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Mar 2017 17:15:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 1598E1A0062 for ; Mon, 13 Mar 2017 17:15:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.651 X-Spam-Level: X-Spam-Status: No, score=0.651 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_NEUTRAL=0.652] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 2MzSryxeoj4B for ; Mon, 13 Mar 2017 17:15:43 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 1365260FE2 for ; Mon, 13 Mar 2017 17:15:43 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 63048E05C1 for ; Mon, 13 Mar 2017 17:15:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id D042B243AE for ; Mon, 13 Mar 2017 17:15:41 +0000 (UTC) Date: Mon, 13 Mar 2017 17:15:41 +0000 (UTC) From: "Parth Chandra (JIRA)" To: dev@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (DRILL-5351) Excessive bounds checking in the Parquet reader MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 13 Mar 2017 17:15:46 -0000 Parth Chandra created DRILL-5351: ------------------------------------ Summary: Excessive bounds checking in the Parquet reader Key: DRILL-5351 URL: https://issues.apache.org/jira/browse/DRILL-5351 Project: Apache Drill Issue Type: Improvement Reporter: Parth Chandra In profiling the Parquet reader, the variable length decoding appears to be a major bottleneck making the reader CPU bound rather than disk bound. A yourkit profile indicates the following methods being severe bottlenecks - VarLenBinaryReader.determineSizeSerial(long) NullableVarBinaryVector$Mutator.setSafe(int, int, int, int, DrillBuf) DrillBuf.chk(int, int) NullableVarBinaryVector$Mutator.fillEmpties() The problem is that each of these methods does some form of bounds checking and eventually of course, the actual write to the ByteBuf is also bounds checked. DrillBuf.chk can be disabled by a configuration setting. Disabling this does improve performance of TPCH queries. In addition, all regression, unit, and TPCH-SF100 tests pass. I would recommend we allow users to turn this check off if there are performance critical queries. Removing the bounds checking at every level is going to be a fair amount of work. In the meantime, it appears that a few simple changes to variable length vectors improves query performance by about 10% across the board. -- This message was sent by Atlassian JIRA (v6.3.15#6346)