Mailing-List: contact issues-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
Date: Mon, 7 Mar 2016 21:24:40 +0000 (UTC)
From: "Suresh Ollala (JIRA)" <jira@apache.org>
To: issues@drill.apache.org
Message-ID: <JIRA.12931826.1453094173000.24436.1457385880693@Atlassian.JIRA>
In-Reply-To: <JIRA.12931826.1453094173000@Atlassian.JIRA>
References: <JIRA.12931826.1453094173000@Atlassian.JIRA>
 <JIRA.12931826.1453094173381@arcas>
Subject: [jira] [Updated] (DRILL-4279) Improve performance for skipAll query
 against Text/JSON/Parquet table
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/DRILL-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suresh Ollala updated DRILL-4279:
---------------------------------
    Reviewer: Dechang Gu

> Improve performance for skipAll query against Text/JSON/Parquet table
> ---------------------------------------------------------------------
>
>                 Key: DRILL-4279
>                 URL: https://issues.apache.org/jira/browse/DRILL-4279
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>             Fix For: 1.6.0
>
>
> When query does not specify any specific column to be returned SCAN,  for instance,
> {code}
> Q1:  select count(*) from T1;
> Q2:  select 1 + 100 from T1;
> Q3:  select  1.0 + random() from T1; 
> {code}
> Drill's planner would use a ColumnList with * column, plus a SKIP_ALL mode. However, the MODE is not serialized / deserialized. This leads to two problems.
> 1).  The EXPLAIN plan is confusing, since there is no way to different from a "SELECT * " query from this SKIP_ALL mode. 
> For instance, 
> {code}
> explain plan for select count(*) from dfs.`/Users/jni/work/data/yelp/t1`;
> 00-03          Project($f0=[0])
> 00-04            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/jni/work/data/yelp/t1, numFiles=2, columns=[`*`], files= ... 
> {code} 
> 2) If the query is to be executed distributed / parallel,  the missing serialization of mode would means some Fragment is fetching all the columns, while some Fragment is skipping all the columns. That will cause execution error.
> For instance, by changing slice_target to enforce the query to be executed in multiple fragments, it will hit execution error. 
> {code}
> select count(*) from dfs.`/Users/jni/work/data/yelp/t1`;
> org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: Error parsing JSON - You tried to start when you are using a ValueWriter of type NullableBitWriterImpl.
> {code}
> Directory "t1" just contains two yelp JSON files. 
> Ideally, I think when no columns is required from SCAN, the explain plan should show an empty of column list. The MODE of SKIP_ALL together with star * column seems to be confusing and error prone. 


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)