Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 90DD1200C53 for ; Tue, 28 Mar 2017 08:06:45 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8F649160B9E; Tue, 28 Mar 2017 06:06:45 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D3AF7160B85 for ; Tue, 28 Mar 2017 08:06:44 +0200 (CEST) Received: (qmail 65021 invoked by uid 500); 28 Mar 2017 06:06:44 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 65012 invoked by uid 99); 28 Mar 2017 06:06:43 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Mar 2017 06:06:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 776E6CED14 for ; Tue, 28 Mar 2017 06:06:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ORYb6f2z-Lli for ; Tue, 28 Mar 2017 06:06:43 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id B249E5FB7A for ; Tue, 28 Mar 2017 06:06:42 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 1C154E059C for ; Tue, 28 Mar 2017 06:06:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 952F025CDF for ; Tue, 28 Mar 2017 06:06:41 +0000 (UTC) Date: Tue, 28 Mar 2017 06:06:41 +0000 (UTC) From: "Kunal Khatua (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Closed] (DRILL-5207) Improve Parquet scan pipelining MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 28 Mar 2017 06:06:45 -0000 [ https://issues.apache.org/jira/browse/DRILL-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khatua closed DRILL-5207. ------------------------------- Verified that by introducing a large queue (1024), the scan threads were able to read faster from the distributed file system. |*queueSize* | *Disk Read Rate (MB/s)*| |1 |194.8428| |2 |207.9598| |1024| 297.5866| Currently, a size of 2 is sufficient since the increased diskrate will still be bottlenecked by the scan operator that needs to translate the parquet pages into value vectors. > Improve Parquet scan pipelining > ------------------------------- > > Key: DRILL-5207 > URL: https://issues.apache.org/jira/browse/DRILL-5207 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet > Affects Versions: 1.9.0 > Reporter: Parth Chandra > Assignee: Parth Chandra > Labels: doc-impacting > Fix For: 1.10.0 > > > The parquet reader's async page reader is not quite efficiently pipelined. > The default size of the disk read buffer is 4MB while the page reader reads ~1MB at a time. The Parquet decode is also processing 1MB at a time. This means the disk is idle while the data is being processed. Reducing the buffer to 1MB will reduce the time the processing thread waits for the disk read thread. > Additionally, since the data to process a page may be more or less than 1MB, a queue of pages will help so that the disk scan does not block (until the queue is full), waiting for the processing thread. > Additionally, the BufferedDirectBufInputStream class reads from disk as soon as it is initialized. Since this is called at setup time, this increases the setup time for the query and query execution does not begin until this is completed. > There are a few other inefficiencies - options are read every time a page reader is created. Reading options can be expensive. -- This message was sent by Atlassian JIRA (v6.3.15#6346)