Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 87D3E200CE9 for ; Sat, 5 Aug 2017 00:53:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 866CA16E64E; Fri, 4 Aug 2017 22:53:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C76F516E64D for ; Sat, 5 Aug 2017 00:53:05 +0200 (CEST) Received: (qmail 24017 invoked by uid 500); 4 Aug 2017 22:53:05 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 24008 invoked by uid 99); 4 Aug 2017 22:53:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Aug 2017 22:53:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 7D8F3180418 for ; Fri, 4 Aug 2017 22:53:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id now8aCQvSgQ0 for ; Fri, 4 Aug 2017 22:53:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 912C95FB6A for ; Fri, 4 Aug 2017 22:53:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C83E1E09A6 for ; Fri, 4 Aug 2017 22:53:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 27E822464D for ; Fri, 4 Aug 2017 22:53:00 +0000 (UTC) Date: Fri, 4 Aug 2017 22:53:00 +0000 (UTC) From: "Jinfeng Ni (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (DRILL-5464) Fix JSON reader when it deals with empty file MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 04 Aug 2017 22:53:06 -0000 [ https://issues.apache.org/jira/browse/DRILL-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16115064#comment-16115064 ] Jinfeng Ni edited comment on DRILL-5464 at 8/4/17 10:52 PM: ------------------------------------------------------------ Run the above query with the patch for DRILL-5546, the umbrella jira for schema change issues related to NULL dataset. The query was finished successfully in multiple runs. {code} select stars, count(*) as cnt from dfs.tmp.yelp group by stars; +--------+---------+ | stars | cnt | +--------+---------+ | 2 | 102737 | | 1 | 110772 | | 4 | 342143 | | 5 | 406045 | | 3 | 163761 | +--------+---------+ {code} Physical plan for the query; {code} 00-00 Screen 00-01 Project(stars=[$0], cnt=[$1]) 00-02 UnionExchange 01-01 HashAgg(group=[{0}], cnt=[$SUM0($1)]) 01-02 Project(stars=[$0], cnt=[$1]) 01-03 HashToRandomExchange(dist0=[[$0]]) 02-01 UnorderedMuxExchange 03-01 Project(stars=[$0], cnt=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)]) 03-02 HashAgg(group=[{0}], cnt=[COUNT()]) 03-03 Scan(groupscan=[EasyGroupScan [selectionRoot=file:/tmp/yelp, numFiles=2, columns=[`stars`], files=[file:/tmp/yelp/empty.json, file:/tmp/yelp/yelp_academic_dataset_review.json]]]) {code} was (Author: jni): Run the above query with the patch for DRILL-5546, the umbrella jira for schema change issues related to NULL dataset. The query was finished successfully. {code} select stars, count(*) as cnt from dfs.tmp.yelp group by stars; +--------+---------+ | stars | cnt | +--------+---------+ | 2 | 102737 | | 1 | 110772 | | 4 | 342143 | | 5 | 406045 | | 3 | 163761 | +--------+---------+ {code} Physical plan for the query; {code} 00-00 Screen 00-01 Project(stars=[$0], cnt=[$1]) 00-02 UnionExchange 01-01 HashAgg(group=[{0}], cnt=[$SUM0($1)]) 01-02 Project(stars=[$0], cnt=[$1]) 01-03 HashToRandomExchange(dist0=[[$0]]) 02-01 UnorderedMuxExchange 03-01 Project(stars=[$0], cnt=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)]) 03-02 HashAgg(group=[{0}], cnt=[COUNT()]) 03-03 Scan(groupscan=[EasyGroupScan [selectionRoot=file:/tmp/yelp, numFiles=2, columns=[`stars`], files=[file:/tmp/yelp/empty.json, file:/tmp/yelp/yelp_academic_dataset_review.json]]]) {code} > Fix JSON reader when it deals with empty file > --------------------------------------------- > > Key: DRILL-5464 > URL: https://issues.apache.org/jira/browse/DRILL-5464 > Project: Apache Drill > Issue Type: Bug > Reporter: Jinfeng Ni > > An empty json file is the one without any json object. If we query an empty json file asking it to return column 'A', Drill's JSON record reader would return a batch with 0 row, and put column 'A' as a nullable int column. A better name for such column might be phantom columns, as the record reader does not have any knowledge of the column schema, and the nullable int column is just a guessed schema. > However, that processing could introduce many issues. Consider if we have a directory consisting of multiple json files and at least one of them is empty. If column 'A' is returned as nullable-int column from the reader over the empty file, while the other json files contains a real typed column 'A', that would cause query hit many issues, including 1) SchemaChangeException, 2) failed in certain operator which does not detect SchemaChange, 3) or incorrect query result, since the run-time code is generated over a phantom column type, not a real type. > For instance, the following query against yelp json file run successfully. > {code} > select count(*), stars from dfs.`/tmp/yelp/yelp_academic_dataset_review.json` group by stars; > {code} > If an empty json file is added to the directory, the query would fail with the following error (which falls into the 2nd category : PartitionSender did not detect schema change properly). > {code} > select count(*), stars from dfs.`/tmp/yelp` group by stars; > Error: SYSTEM ERROR: IllegalStateException: Failure while reading vector. Expected vector class of org.apache.drill.exec.vector.NullableIntVector but was holding vector class org.apache.drill.exec.vector.NullableBigIntVector, field= stars(BIGINT:OPTIONAL)[$bits$(UINT1:REQUIRED), stars(BIGINT:OPTIONAL)] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)