Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B915417C86 for ; Fri, 25 Sep 2015 05:01:30 +0000 (UTC) Received: (qmail 29627 invoked by uid 500); 25 Sep 2015 05:01:30 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 29571 invoked by uid 500); 25 Sep 2015 05:01:30 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 29559 invoked by uid 99); 25 Sep 2015 05:01:30 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Sep 2015 05:01:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 9BAEEC1FBB for ; Fri, 25 Sep 2015 05:01:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.15 X-Spam-Level: X-Spam-Status: No, score=0.15 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id Ji-iUl79VfJm for ; Fri, 25 Sep 2015 05:01:23 +0000 (UTC) Received: from mail-wi0-f169.google.com (mail-wi0-f169.google.com [209.85.212.169]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 89BF92074F for ; Fri, 25 Sep 2015 05:01:23 +0000 (UTC) Received: by wicfx3 with SMTP id fx3so4242436wic.0 for ; Thu, 24 Sep 2015 22:01:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=1pT3C4dw9PbkyiMnkf5yugiHE2ekzSfC3sKyretCKVg=; b=vSZRO+US/JjP8BdotrDB2u1kg+iRxLsW7QCwYUFFVkLFqPEZjL2nxvkgCaI1OwBNKg vHsMBUCD2tnqIlsa2am0SIqnZ9q570qdndjDoe0P7i1hjgosXk+G2MSsqpwRCKMDeLet uDR0s0o939bGw+mEDEok9+WGDXeP3YFdxhsoYyfX+06Rckeyd5sRnNk7CNwXQMKtDA7M y3QDqNe+597YfgtO5QxrAJlmIx8jGzecGCA9ohm44ez4R/JUF+U8D58qqIJ0E3C+t1qg MclTfPqw1SJSmKP72Jd6oW6jgTIurRu+prQZJXHHWjtQygXgNSjAXdkXZKbM2I8irQav hp+g== MIME-Version: 1.0 X-Received: by 10.180.24.102 with SMTP id t6mr889785wif.83.1443157282255; Thu, 24 Sep 2015 22:01:22 -0700 (PDT) Received: by 10.28.194.67 with HTTP; Thu, 24 Sep 2015 22:01:22 -0700 (PDT) In-Reply-To: References: Date: Thu, 24 Sep 2015 22:01:22 -0700 Message-ID: Subject: Re: Improving Performance of SELECT * FROM hive.table LIMIT 0 From: Jinfeng Ni To: dev@drill.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable "FragmentExecutor took 1,070,926 ms to create RecordBatch tree." 1,070,926 ms ~ 17.x minutes. In other words, the majority of 18 minutes of execution in hive case is spent on the initialization of Hive readers. If we want to improve "limit n", we probably should make "lazy" initialization of Hive reader; only when Drill has to read rows from reader, we do the initialization. Otherwise, to initialize all the readers before reading any single row means long setup time for limit "n" query, when n is relative small. For the second case, the 94 seconds query time seems to be too long as well. I guess most of the time is spent on parquet reader initialization (?) On Thu, Sep 24, 2015 at 9:32 PM, Sudheesh Katkam wro= te: > Hey y'all, > > ### Short Question: > > How do we improve performance of SELECT * FROM plugin.table LIMIT 0? > > ### Extended Question: > > While investigating DRILL-3623 , I did an analysis to see where we spend time for SELECT * FROM hi= ve.table LIMIT 0 query. > > ## Setup: > Copy the drill/sample-data/region.parquet (x 20000) into a DFS (MapR-FS i= n my case) directory named region. Create a Hive external table pointing to= region. Run Drill with default configuration. > > ## Now there are two ways to query this table: > >> SELECT * FROM hive.region LIMIT 0; > +--------------+---------+------------+ > | r_regionkey | r_name | r_comment | > +--------------+---------+------------+ > +--------------+---------+------------+ > No rows selected (1203.179 seconds) > ... > >> SELECT * FROM dfs.test.region LIMIT 0; > +--------------+---------+------------+ > | r_regionkey | r_name | r_comment | > +--------------+---------+------------+ > +--------------+---------+------------+ > No rows selected (94.396 seconds) > > Currently, we use HiveRecordReader for the first case and ParquetRecordRe= ader in the second case. With DRILL-3209 , both queries will use ParquetRecordReader. However, for f= ormats that are non-native to Drill or other storage plugins, we still face= this problem. Summarizing the query profile, > +-------+-----------+---------------+----------------+ > | Query | Fragments | Planning time | Execution time | > +-------+-----------+---------------+----------------+ > | hive | 1 | ~2 min | ~18 min | > | dfs | 1 | ~1 min | ~33 sec | > +-------+-----------+---------------+----------------+ > > ## The time hogs: > > # Planning time in both cases needs to improve. How? > > # With respect to execution, in the first case ImplCreator.getExec(=E2=80= =A6) call in the FragmentExecutor took 1,070,926 ms to create RecordBatch t= ree. There are 20,000 readers being initialized in HiveScanBatchCreator. Ho= w do we avoid this? What are the implications of chained impersonation (ope= ning readers in ctor() rather than in setup())? > > ### Extending further: > > This can be generalized to any "LIMIT n" query with n is a small number. = For n > 0, we parallelize scanning. So LIMIT 1 query runs faster than LIMIT= 0. However there is a sweet "n" after which parallelization hurts. > > ### > > Thank you, > Sudheesh >