drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinfeng Ni (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-5464) Fix JSON reader when it deals with empty file
Date Fri, 04 Aug 2017 22:53:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16115064#comment-16115064
] 

Jinfeng Ni edited comment on DRILL-5464 at 8/4/17 10:52 PM:
------------------------------------------------------------

Run the above query with the patch for DRILL-5546, the umbrella jira for schema change issues
related to NULL dataset.  The query was finished successfully in multiple runs. 

{code}
 select stars, count(*) as cnt from dfs.tmp.yelp group by stars;
+--------+---------+
| stars  |   cnt   |
+--------+---------+
| 2      | 102737  |
| 1      | 110772  |
| 4      | 342143  |
| 5      | 406045  |
| 3      | 163761  |
+--------+---------+
{code} 

Physical plan for the query; 
{code}
00-00    Screen
00-01      Project(stars=[$0], cnt=[$1])
00-02        UnionExchange
01-01          HashAgg(group=[{0}], cnt=[$SUM0($1)])
01-02            Project(stars=[$0], cnt=[$1])
01-03              HashToRandomExchange(dist0=[[$0]])
02-01                UnorderedMuxExchange
03-01                  Project(stars=[$0], cnt=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0,
1301011)])
03-02                    HashAgg(group=[{0}], cnt=[COUNT()])
03-03                      Scan(groupscan=[EasyGroupScan [selectionRoot=file:/tmp/yelp, numFiles=2,
columns=[`stars`], files=[file:/tmp/yelp/empty.json, file:/tmp/yelp/yelp_academic_dataset_review.json]]])
{code}


was (Author: jni):
Run the above query with the patch for DRILL-5546, the umbrella jira for schema change issues
related to NULL dataset.  The query was finished successfully.

{code}
 select stars, count(*) as cnt from dfs.tmp.yelp group by stars;
+--------+---------+
| stars  |   cnt   |
+--------+---------+
| 2      | 102737  |
| 1      | 110772  |
| 4      | 342143  |
| 5      | 406045  |
| 3      | 163761  |
+--------+---------+
{code} 

Physical plan for the query; 
{code}
00-00    Screen
00-01      Project(stars=[$0], cnt=[$1])
00-02        UnionExchange
01-01          HashAgg(group=[{0}], cnt=[$SUM0($1)])
01-02            Project(stars=[$0], cnt=[$1])
01-03              HashToRandomExchange(dist0=[[$0]])
02-01                UnorderedMuxExchange
03-01                  Project(stars=[$0], cnt=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0,
1301011)])
03-02                    HashAgg(group=[{0}], cnt=[COUNT()])
03-03                      Scan(groupscan=[EasyGroupScan [selectionRoot=file:/tmp/yelp, numFiles=2,
columns=[`stars`], files=[file:/tmp/yelp/empty.json, file:/tmp/yelp/yelp_academic_dataset_review.json]]])
{code}

> Fix JSON reader when it deals with empty file
> ---------------------------------------------
>
>                 Key: DRILL-5464
>                 URL: https://issues.apache.org/jira/browse/DRILL-5464
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>
> An empty json file is the one without any json object.  If we query an empty json file
asking it to return column 'A',  Drill's JSON record reader would return a batch with 0 row,
and put column 'A' as a nullable int column. A better name for such column might be phantom
columns, as the record reader does not have any knowledge of the column schema, and the nullable
int column is just a guessed schema. 
> However, that processing could introduce many issues. Consider if we have a directory
consisting of multiple json files and at least one of them is empty.  If column 'A' is returned
as nullable-int column from the reader over the empty file, while the other json files contains
a real typed column 'A', that would cause query hit many issues, including 1) SchemaChangeException,
2) failed in certain operator which does not detect SchemaChange, 3) or incorrect query result,
since the run-time code is generated over a phantom column type, not a real type.
> For instance, the following query against yelp json file run successfully.
> {code}
> select count(*), stars  from dfs.`/tmp/yelp/yelp_academic_dataset_review.json` group
by stars;
> {code}
> If an empty json file is added to the directory,  the query would fail with the following
error (which falls into the 2nd category : PartitionSender did not detect schema change properly).
 
> {code}
> select count(*), stars  from dfs.`/tmp/yelp` group by stars;
> Error: SYSTEM ERROR: IllegalStateException: Failure while reading vector.  Expected vector
class of org.apache.drill.exec.vector.NullableIntVector but was holding vector class org.apache.drill.exec.vector.NullableBigIntVector,
field= stars(BIGINT:OPTIONAL)[$bits$(UINT1:REQUIRED), stars(BIGINT:OPTIONAL)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message