drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacques Nadeau (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4520) Error parsing JSON ( a column with different datatypes )
Date Fri, 25 Mar 2016 16:52:25 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212064#comment-15212064
] 

Jacques Nadeau commented on DRILL-4520:
---------------------------------------

I believe the second problem you are having (once enabling union type) is that you have mixed
data in one of your columns and currently, Parquet doesn't support mixed schema natively and
the write fails. Can you enable 
{code}
SET `exec.errors.verbose` = true
{code}

And share the complete error message so we can confirm?

If this is the case, you can probably use filters to only get the data that has consistent
schema.

Note that this is probably why you have success with Hive: it simply deletes records with
inconsistent schema.

> Error parsing JSON ( a column with different datatypes )
> --------------------------------------------------------
>
>                 Key: DRILL-4520
>                 URL: https://issues.apache.org/jira/browse/DRILL-4520
>             Project: Apache Drill
>          Issue Type: Test
>            Reporter: Shankar
>
> I am stuck in the middle of somewhere. Could you please help me to resolve below error.
> I am running query on drill 1.6.0 in cluster on logs json data (150GB size of log file)
( 1 json / line).
> {quote}
> solution as per my opinion - 
> 1. Either drill should able to ignore those lines(ANY data type) while reading or creating
the table (CTAS).
> 2. Or Data will get stored as it is with ANY data type if any fields in data differs
in their data types. This will be useful in the case where other columns (excluding ANY data
type columns) carrying important informations.
> {quote}
> h4. -------------- test.json --------------
> Abount Data : 
> 1. I have just extract 3 lines from logs for test purpose.
> 2. In data field called "ajaxUrl" is differ in datatype. Sometimes it contains string
and sometime array of jsons and null as well. 
> 3. Here in our case - Some events in 150 gb json file are like this where they differ
in structure. I could say there are only 0.1% (per 150gb json file) are such events.
> {noformat}
> {"ajaxData":null,"metadata":null,"ajaxUrl":"/player/updatebonus1","selectedItem":null,"sessionid":"BC497C7C39B3C90AC9E6E9E8194C3","timestamp":1457658600032}
> {"gameId":"https://daemon2.com/tournDetails.do?type=myGames&userId=1556148&jsonp_callback=jQuery213043","ajaxData":null,"metadata":null,"ajaxUrl":[{"R":0,"rNo":1,"gid":4,"wal":0,"d":{"gid":4,"pt":3,"wc":2326,"top":"1","reg":true,"brkt":1457771400268,"sk":"25070010105301000009","id":56312439,"a":0,"st":1457771400000,"e":"0.0","j":0,"n":"Loot
Qualifier 1","tc":94,"et":0,"syst":1457771456,"rc":145770000,"s":5,"t":10000,"tk":false,"prnId":56311896,"jc":10000,"tp":"100000.0","ro":145400000,"rp":0,"isprn":false},"fl":"192.168.35.42","aaid":"5828"}],"selectedItem":null,"sessionid":"D18104E8CA3071C7A8F4E141B127","timestamp":1457771458873}
> {"ajaxData":null,"metadata":null,"ajaxUrl":"/player/updatebonus2","selectedItem":null,"sessionid":"BC497C7C39B3C90AC9E6E9E8194C3","timestamp":1457958600032}
> {noformat}
> h4. -------------- Select Query  (ERROR) --------------
> {noformat}
> select
> `timestamp`,
> sessionid,
> gameid,
> ajaxUrl,
> ajaxData
> from dfs.`/tmp/test.json` t
> ;
> {noformat}
> {color:red}
> Error: DATA_READ ERROR: Error parsing JSON - You tried to start when you are using a
ValueWriter of type NullableVarCharWriterImpl.
> File  /tmp/test.json
> Record  2
> Fragment 0:0
> {color}
> h4. -------------- Select Query (works Fine with UNION type) --------------
> Tried UNION type (an experimental feature)
> set `exec.enable_union_type` = true;
> {noformat}
> set `exec.enable_union_type` = true;
> +-------+----------------------------------+
> |  ok   |             summary              |
> +-------+----------------------------------+
> | true  | exec.enable_union_type updated.  |
> +-------+----------------------------------+
> 1 row selected (0.193 seconds)
> select
> `timestamp`,
> sessionid,
> gameid,
> ajaxUrl,
> ajaxData
> from dfs.`/tmp/test.json` t
> ;
> +----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
> |   timestamp    |           sessionid            |                                 
          gameid                                            |        ajaxUrl        | ajaxData
 |
> +----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
> | 1457658600032  | BC497C7C39B3C90AC9E6E9E8194C3  | null                            
                                                            | /player/updatebonus1  | null
     |
> | 1457771458873  | D18104E8CA3071C7A8F4E141B127   | https://daemon2.com/tournDetails.do?type=myGames&userId=1556148&jsonp_callback=jQuery213043
 | []                    | null      |
> | 1457958600032  | BC497C7C39B3C90AC9E6E9E8194C3  | null                            
                                                            | /player/updatebonus2  | null
     |
> +----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
> 3 rows selected (0.965 seconds)
> {noformat}
> h4. -------------- CTAS Query (ERROR) --------------
> {noformat}
> set `exec.enable_union_type` = true;
> +-------+----------------------------------+
> |  ok   |             summary              |
> +-------+----------------------------------+
> | true  | exec.enable_union_type updated.  |
> +-------+----------------------------------+
> 1 row selected (0.193 seconds)
> create table dfs.tmp.test1 AS 
> select
> `timestamp`,
> sessionid,
> gameid,
> ajaxUrl,
> ajaxData
> from dfs.`/tmp/test.json` t
> ;
> {noformat}
> {color:red}
> Error: SYSTEM ERROR: NullPointerException
> Fragment 0:0
> {color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message