Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7A64D1777A for ; Fri, 24 Apr 2015 17:34:39 +0000 (UTC) Received: (qmail 75718 invoked by uid 500); 24 Apr 2015 17:34:39 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 75687 invoked by uid 500); 24 Apr 2015 17:34:39 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 75618 invoked by uid 99); 24 Apr 2015 17:34:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Apr 2015 17:34:39 +0000 Date: Fri, 24 Apr 2015 17:34:39 +0000 (UTC) From: "Rahul Challapalli (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (DRILL-2869) Incorrect data when we have fields missing in some of the files - another case MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DRILL-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rahul Challapalli updated DRILL-2869: ------------------------------------- Description: git.commit.id.abbrev=5cd36c5 Data File1 : a.json {code} { "c1" : 1, "m1" : {"m2" : {"m3" : {"c2" : 5} } } } { "c1" : 2, "m1" : {"m2" : {"m3" : {"c2" : 6} } } } { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } {code} Data File2 : b.json {code} { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } {code} Data File3 : c.json {code} { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } {code} The below query reports incorrect results for both json and parquet formats. It returns empty maps when it should not. This issue is even present when we query equivalent parquet files {code} select t.m1.m2 from `delme_repro` as `t`; +------------+ | EXPR$0 | +------------+ | {"c2":5} | | {"c2":5} | | {"c2":5} | | {"c2":5} | | {"c2":5} | | {"c2":5} | | {} | | {} | | {"c2":5} | +------------+ {code} However if I run the same query on the specific file, I get the correct output {code} select t.m1.m2 from `delme_repro/a.json` as `t`; +------------+ | EXPR$0 | +------------+ | {"m3":{"c2":5}} | | {"m3":{"c2":6}} | | {"m3":{},"c2":5} | +------------+ 3 rows selected (0.113 seconds) {code} Let me know if you have any questions was: git.commit.id.abbrev=5cd36c5 Data File1 : a.json {code} { "c1" : 1, "m1" : {"m2" : {"m3" : {"c2" : 5} } } } { "c1" : 2, "m1" : {"m2" : {"m3" : {"c2" : 6} } } } { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } {code} Data File2 : b.json {code} { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } {code} Data File3 : c.json {code} { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } {code} The below query reports incorrect data : {code} select t.m1.m2.m3 from `delme_repro` as `t`; +------------+ | EXPR$0 | +------------+ | null | | null | | null | | null | | null | | null | | null | | null | | null | +------------+ 9 rows selected (0.139 seconds) {code} However if I run the same query on the specific file, I get the correct output {code} select t.m1.m2.m3 from `delme_repro/a.json` as `t`; +------------+ | EXPR$0 | +------------+ | {"c2":5} | | {"c2":6} | | {} | +------------+ 3 rows selected (0.113 seconds) {code} It looks like the file size plays a part in deciding the order in which Drill reads the files. But there could be more to this than just the order because when I made sure that 'b.json' and 'c.json' only had one records, drill correctly reported the data. Let me know if you have any questions > Incorrect data when we have fields missing in some of the files - another case > ------------------------------------------------------------------------------ > > Key: DRILL-2869 > URL: https://issues.apache.org/jira/browse/DRILL-2869 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators, Storage - JSON, Storage - Parquet > Reporter: Rahul Challapalli > Assignee: Hanifi Gunes > Priority: Critical > > git.commit.id.abbrev=5cd36c5 > Data File1 : a.json > {code} > { "c1" : 1, "m1" : {"m2" : {"m3" : {"c2" : 5} } } } > { "c1" : 2, "m1" : {"m2" : {"m3" : {"c2" : 6} } } } > { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } > {code} > Data File2 : b.json > {code} > { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } > { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } > { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } > {code} > Data File3 : c.json > {code} > { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } > { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } > { "c1" : 3, "m1" : {"m2" : {"c2" : 5} } } > {code} > The below query reports incorrect results for both json and parquet formats. It returns empty maps when it should not. This issue is even present when we query equivalent parquet files > {code} > select t.m1.m2 from `delme_repro` as `t`; > +------------+ > | EXPR$0 | > +------------+ > | {"c2":5} | > | {"c2":5} | > | {"c2":5} | > | {"c2":5} | > | {"c2":5} | > | {"c2":5} | > | {} | > | {} | > | {"c2":5} | > +------------+ > {code} > However if I run the same query on the specific file, I get the correct output > {code} > select t.m1.m2 from `delme_repro/a.json` as `t`; > +------------+ > | EXPR$0 | > +------------+ > | {"m3":{"c2":5}} | > | {"m3":{"c2":6}} | > | {"m3":{},"c2":5} | > +------------+ > 3 rows selected (0.113 seconds) > {code} > Let me know if you have any questions -- This message was sent by Atlassian JIRA (v6.3.4#6332)