Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 72034189B3 for ; Sun, 7 Feb 2016 17:52:40 +0000 (UTC) Received: (qmail 2927 invoked by uid 500); 7 Feb 2016 17:52:40 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 2893 invoked by uid 500); 7 Feb 2016 17:52:40 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 2866 invoked by uid 99); 7 Feb 2016 17:52:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Feb 2016 17:52:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 085542C1F69 for ; Sun, 7 Feb 2016 17:52:40 +0000 (UTC) Date: Sun, 7 Feb 2016 17:52:40 +0000 (UTC) From: "Aman Sinha (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (DRILL-4365) Performance with lots of small parquet files MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Aman Sinha created DRILL-4365: --------------------------------- Summary: Performance with lots of small parquet files Key: DRILL-4365 URL: https://issues.apache.org/jira/browse/DRILL-4365 Project: Apache Drill Issue Type: Bug Components: Storage - Parquet Affects Versions: 1.5.0 Reporter: Aman Sinha I am seeing a performance degradation on 1.5.0 compared to 1.4.0 with a query over 968 small parquet files where the total # rows is only 1000, so just about 1 row per file. The profile shows parquet scan is slower. With bigger tables, I haven't seen the same issue yet (although need confirmation from the full performance run). Note: this is with default slice_target of 100K so only 1 scan fragment was used. I will attach the dataset to this JIRA if anyone wants to repro. On 1.4.0: (with multiple runs): {noformat} 0: jdbc:drill:zk=local> select min(ss_item_sk) from dfs.tmp.ss1test ; +---------+ | EXPR$0 | +---------+ | 39 | +---------+ 1 row selected (2.544 seconds) 0: jdbc:drill:zk=local> select min(ss_item_sk) from dfs.tmp.ss1test ; +---------+ | EXPR$0 | +---------+ | 39 | +---------+ 1 row selected (2.434 seconds) {noformat} On 1.5.0: (multiple runs): {noformat} 0: jdbc:drill:zk=local> select min(ss_item_sk) from dfs.tmp.ss1test ; +---------+ | EXPR$0 | +---------+ | 39 | +---------+ 1 row selected (3.851 seconds) 0: jdbc:drill:zk=local> select min(ss_item_sk) from dfs.tmp.ss1test ; +---------+ | EXPR$0 | +---------+ | 39 | +---------+ 1 row selected (3.61 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)