drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacques Nadeau (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-4279) Improve query plan when no column is required from SCAN
Date Thu, 28 Jan 2016 02:30:39 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15120623#comment-15120623
] 

Jacques Nadeau edited comment on DRILL-4279 at 1/28/16 2:30 AM:
----------------------------------------------------------------

Yes, I don't think the planner should try to figure these things out. The reader should be
responsible for implementing the skipAll semantic. Otherwise we limit the ability for the
execution layer to implement things most efficiently. 

In terms of plan clarity, I support making the plan clearer (and in general think we need
to do this throughout the plan). How about we clean up the toString() method. My suggestion
would be that the plan string is constructed to make sense. For example, there is no reason
to even mention Scan or EasyGroupScan. So the plan would include one of the following:

Columns = [] (no columns needed -- tracked as skip_mode yes in the actual class)
Columns = [*] 
Columns = ['a', 'b', 'c']

or something along those lines.

So your example is something much clearer such as (assuming this is Parquet files):

{code}
explain plan for select count(*) from dfs.`/Users/jni/work/data/yelp/t1`;
00-03          Project($f0=[0])
00-04            ParquetScan(selectionRoot=file:/Users/jni/work/data/yelp/t1, numFiles=2,
columns=[], files= ... 
{code}




was (Author: jnadeau):
Yes, I don't think the planner should try to figure these things out. The reader should be
responsible for implementing the skipAll semantic. Otherwise we limit the ability for the
execution layer to implement things most efficiently. 

In terms of plan clarity, I support making the plan clearer (and in general think we need
to do this throughout the plan). How about we clean up the toString() method of I don't think
it makes sense to have the planner know how a rowcount is created. My suggestion would be
that the plan string is constructed to make sense. For example, there is no reason to even
mention easy group scan. So the plan would be one of the following:

Columns = [] (no columns needed -- tracked as skip_mode yes in the actual class)
Columns = [*] 
Columns = ['a', 'b', 'c']

or something along those lines.

So your example is something much clearer such as (assuming this is Parquet files):

{code}
explain plan for select count(*) from dfs.`/Users/jni/work/data/yelp/t1`;
00-03          Project($f0=[0])
00-04            ParquetScan(selectionRoot=file:/Users/jni/work/data/yelp/t1, numFiles=2,
columns=[], files= ... 
{code}



> Improve query plan when no column is required from SCAN
> -------------------------------------------------------
>
>                 Key: DRILL-4279
>                 URL: https://issues.apache.org/jira/browse/DRILL-4279
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> When query does not specify any specific column to be returned SCAN,  for instance,
> {code}
> Q1:  select count(*) from T1;
> Q2:  select 1 + 100 from T1;
> Q3:  select  1.0 + random() from T1; 
> {code}
> Drill's planner would use a ColumnList with * column, plus a SKIP_ALL mode. However,
the MODE is not serialized / deserialized. This leads to two problems.
> 1).  The EXPLAIN plan is confusing, since there is no way to different from a "SELECT
* " query from this SKIP_ALL mode. 
> For instance, 
> {code}
> explain plan for select count(*) from dfs.`/Users/jni/work/data/yelp/t1`;
> 00-03          Project($f0=[0])
> 00-04            Scan(groupscan=[EasyGroupScan [selectionRoot=file:/Users/jni/work/data/yelp/t1,
numFiles=2, columns=[`*`], files= ... 
> {code} 
> 2) If the query is to be executed distributed / parallel,  the missing serialization
of mode would means some Fragment is fetching all the columns, while some Fragment is skipping
all the columns. That will cause execution error.
> For instance, by changing slice_target to enforce the query to be executed in multiple
fragments, it will hit execution error. 
> {code}
> select count(*) from dfs.`/Users/jni/work/data/yelp/t1`;
> org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: Error parsing
JSON - You tried to start when you are using a ValueWriter of type NullableBitWriterImpl.
> {code}
> Directory "t1" just contains two yelp JSON files. 
> Ideally, I think when no columns is required from SCAN, the explain plan should show
an empty of column list. The MODE of SKIP_ALL together with star * column seems to be confusing
and error prone. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message