Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 85468200C70 for ; Thu, 4 May 2017 19:12:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 84040160BC4; Thu, 4 May 2017 17:12:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C8C93160B9B for ; Thu, 4 May 2017 19:12:07 +0200 (CEST) Received: (qmail 62729 invoked by uid 500); 4 May 2017 17:12:07 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 62685 invoked by uid 99); 4 May 2017 17:12:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 May 2017 17:12:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 26291C28F9 for ; Thu, 4 May 2017 17:12:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id zssK3mn9FjzF for ; Thu, 4 May 2017 17:12:05 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 2DBE25FC96 for ; Thu, 4 May 2017 17:12:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id BDEF7E0A2F for ; Thu, 4 May 2017 17:12:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 26EC021DEF for ; Thu, 4 May 2017 17:12:04 +0000 (UTC) Date: Thu, 4 May 2017 17:12:04 +0000 (UTC) From: "Paul Rogers (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-5472) Parquet reader generating low-density batches causing Sort operator to spill un-necessarily MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 04 May 2017 17:12:08 -0000 [ https://issues.apache.org/jira/browse/DRILL-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997100#comment-15997100 ] Paul Rogers commented on DRILL-5472: ------------------------------------ This is a known issue with Parquet, but one that is not currently a high priority. The thought here is that this issue will be resolved as a side-effect of the fix for DRILL-5211. For that bug, we must limit vector sizes to 16 MB. At present, the Parquet reader tries, but fails, to limit vector sizes. That failure causes random vector sizes and low density. Fixing the Parquet vector limit to avoid fragmentation will also, perhaps, reduced the low-density issue without the issue itself having to be a high priority. > Parquet reader generating low-density batches causing Sort operator to spill un-necessarily > ------------------------------------------------------------------------------------------- > > Key: DRILL-5472 > URL: https://issues.apache.org/jira/browse/DRILL-5472 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators, Storage - Parquet > Reporter: Rahul Challapalli > Assignee: Paul Rogers > Attachments: drill5472.log, drill5472.parquet, drill5472.sys.drill > > > git.commit.id.abbrev=1e0a14c > The parquet file used in the below query is ~20MB. The uncompressed size id ~1.2 GB. Now the below query has a sort which is given ~6GB memory for a single fragment and yet it spills. > {code} > select * from (select * from dfs.`/drill/testdata/resource-manager/all_types_large` s order by s.missing12.x) d where d.missing3 is false; > {code} > The profile indicates that the above query has spilled twice. Attached the profile and the logs -- This message was sent by Atlassian JIRA (v6.3.15#6346)