drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Khurram Faraaz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing
Date Wed, 02 Nov 2016 19:03:58 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630075#comment-15630075
] 

Khurram Faraaz commented on DRILL-4653:
---------------------------------------

I don't this this is fixed, there are still some cases that need to be taken care of. Please
see below.
Also, more importantly this checking for malformed JSON should be ON/enabled by default in
Drill. Users will like to ignore bad records, rather than see an Exception/Error and then
our support suggest them to enable this skip_invalid_records. This I believe should be ON
by default in Drill.

[test@cent01 drill_4653]# cat badjson_01.json
{"key":"test string"}
{"key":"foo"}
{"key":"foobar"
{"key":"blah"}
{"key":"temp"}

[test@cent01 drill_4653]# cat badjson_02.json
{
    "key":"foo",
    "badarray":[1,3,4,5,6,7,8,,
    "key":"test string",
    "key":"foobar"
}
[test@cent01 drill_4653]#

[test@cent01 drill_4653]# cat badjson_03.json
{
    "key":"foo",
    "key":"foobar",
    "key":"test string",
    "key":"string",
    "key":
}
[test@cent01 drill_4653]#

[test@cent01 drill_4653]# cat badjson_04.json
{"key":1}
{"key":2}
{"key":3}
{"key":
[test@cent01 drill_4653]

[test@cent01 drill_4653]# cat badjson_05.json
{
    "key1":"foobar",
    "key2":[1,3,4,5,6,7,8,9],
    "key3":{ "key4":},
    "key5":"foo"
}
[test@cent01 drill_4653]

[test@cent01 drill_4653]# cat badjson_06.json
{
    "name":"John Doe",
    "age":33,
    "dept":"IT",
    "address":{
                  "street":"some street",
                  "city":"some city",
                  "zip":
              }
    "isManager":"yes"
}
[test@cent01 drill_4653]

{noformat}
0: jdbc:drill:schema=dfs.tmp> alter session set `store.json.reader.skip_invalid_records`=true;
+-------+--------------------------------------------------+
|  ok   |                     summary                      |
+-------+--------------------------------------------------+
| true  | store.json.reader.skip_invalid_records updated.  |
+-------+--------------------------------------------------+
1 row selected (0.334 seconds)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_01.json`;
+--------------+
|     key      |
+--------------+
| test string  |
| foo          |
| temp         |
+--------------+
3 rows selected (0.466 seconds)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_01.json`;
+--------------+
|     key      |
+--------------+
| test string  |
| foo          |
| temp         |
+--------------+
3 rows selected (0.222 seconds)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_02.json`;
Error: DATA_READ ERROR: Unexpected character (',' (code 44)): expected a valid value (number,
String, array, object, 'true', 'false' or 'null')
 at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@3e4c2712; line: 3, column:
32]

Line  3
Column  33
Field  badarray
Fragment 0:0

[Error Id: 6da211b5-a287-4239-82b4-26a35e47ed10 on centos-01.qa.lab:31010] (state=,code=0)
{noformat}

Stack trace from drillbit.log for above failure
{noformat}
org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: Unexpected character (','
(code 44)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@3e4c2712; line: 3, column:
32]

Line  3
Column  33
Field  badarray

[Error Id: 6da211b5-a287-4239-82b4-26a35e47ed10 ]
        at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:543)
~[drill-common-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:586)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:372)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.vector.complex.fn.JsonReader.writeDataSwitch(JsonReader.java:306)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.vector.complex.fn.JsonReader.writeToVector(JsonReader.java:247)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.vector.complex.fn.JsonReader.write(JsonReader.java:202) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.store.easy.json.JSONRecordReader.next(JSONRecordReader.java:206)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:178) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:135)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:81)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:94) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:232)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:226)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at java.security.AccessController.doPrivileged(Native Method) [na:1.8.0_91]
        at javax.security.auth.Subject.doAs(Subject.java:422) [na:1.8.0_91]
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
[hadoop-common-2.7.0-mapr-1607.jar:na]
        at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:226)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
[drill-common-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_91]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_91]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character (',' (code
44)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@3e4c2712; line: 3, column:
32]
        at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1586) ~[jackson-core-2.7.1.jar:2.7.1]
        at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:521)
~[jackson-core-2.7.1.jar:2.7.1]
        at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:450)
~[jackson-core-2.7.1.jar:2.7.1]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2628)
~[jackson-core-2.7.1.jar:2.7.1]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:854)
~[jackson-core-2.7.1.jar:2.7.1]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:748)
~[jackson-core-2.7.1.jar:2.7.1]
        at org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:537)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        ... 24 common frames omitted
{noformat}


{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_02.json`;
+------+
| key  |
+------+
+------+
No rows selected (0.477 seconds)
{noformat}

This query should return "foo", "foobar", "test string", "string" in 4 rows.
{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_03.json`;
+------+
| key  |
+------+
+------+
No rows selected (0.208 seconds)
{noformat}

This query should return "foobar"
{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_03.json` where key ='foobar';
+------+
| key  |
+------+
+------+
No rows selected (0.253 seconds)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_04.json`;
+------+
| key  |
+------+
| 1    |
| 2    |
| 3    |
+------+
3 rows selected (0.232 seconds)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_04.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input within/between OBJECT
entries

File  /tmp/badjson_04.json
Record  4
Column  39
Fragment 0:0

[Error Id: a30668ff-8bdc-44bc-aeac-c566e2f731b6 on centos-01.qa.lab:31010] (state=,code=0)

Stack trace from drillbit.log

Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input within/between
OBJECT entries
 at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@37039ebe; line: 5, column:
39]
        at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1586) ~[jackson-core-2.7.1.jar:2.7.1]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipColon2(UTF8StreamJsonParser.java:3038)
~[jackson-core-2.7.1.jar:2.7.1]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipColon(UTF8StreamJsonParser.java:2950)
~[jackson-core-2.7.1.jar:2.7.1]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:756)
~[jackson-core-2.7.1.jar:2.7.1]
        at org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:350)
~[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.vector.complex.fn.JsonReader.writeDataSwitch(JsonReader.java:306)
~[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.vector.complex.fn.JsonReader.writeToVector(JsonReader.java:247)
~[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.vector.complex.fn.JsonReader.write(JsonReader.java:202) ~[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        at org.apache.drill.exec.store.easy.json.JSONRecordReader.next(JSONRecordReader.java:206)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
        ... 19 common frames omitted
{noformat}

This query should return "foobar" in key1 and arracy [1,3,4,5,6,7,8,9] in key2
{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_05.json`;
+-------+-------+
| key1  | key2  |
+-------+-------+
+-------+-------+
No rows selected (0.229 seconds)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp>  select key1 from `badjson_05.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('}' (code 125)): expected
a value

File  /tmp/badjson_05.json
Record  1
Column  22
Fragment 0:0

[Error Id: 01a8ce3b-b0c0-41c5-92cd-3467265b60a6 on centos-01.qa.lab:31010] (state=,code=0)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp>  select key2 from `badjson_05.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('}' (code 125)): expected
a value

File  /tmp/badjson_05.json
Record  1
Column  22
Fragment 0:0

[Error Id: 40bb646b-18e7-4dff-812d-f409ea1fcf27 on centos-01.qa.lab:31010] (state=,code=0)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_06.json`;
+-------+------+-------+----------+
| name  | age  | dept  | address  |
+-------+------+-------+----------+
+-------+------+-------+----------+
No rows selected (0.205 seconds)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select name from `badjson_06.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('}' (code 125)): expected
a value

File  /tmp/badjson_06.json
Record  1
Column  16
Fragment 0:0

[Error Id: b549023e-1f54-418c-adc5-9a21cf0ec3aa on centos-01.qa.lab:31010] (state=,code=0)
{noformat}


> Malformed JSON should not stop the entire query from progressing
> ----------------------------------------------------------------
>
>                 Key: DRILL-4653
>                 URL: https://issues.apache.org/jira/browse/DRILL-4653
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - JSON
>    Affects Versions: 1.6.0
>            Reporter: subbu srinivasan
>             Fix For: 1.9.0
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message