drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdel Hakim Deneche <adene...@maprtech.com>
Subject Re: [DISCUSS] Querying nested data from JSON with and without json_all_text_mode results in errors
Date Thu, 10 Sep 2015 22:30:24 GMT
It's interesting that the following query also fails:

SELECT COUNT(meta) FROM `rows.json`

the schema change is nested inside meta.view, so even though we are
counting how many meta "groups" we have, Drill still reads and materializes
all the content of the meta field.



On Thu, Sep 10, 2015 at 2:16 PM, Jason Altekruse <altekrusejason@gmail.com>
wrote:

> This is a known issue, we have been describing this schema change scenario
> as a change in "data shape". Here is a very simple dataset that will
> produce the same error
>
> { "a" : 1 }
> { "a" : { "b" : 3} }
>
> We currently only use all_text_mode to change the type of the data at a
> leaf in the schema.
>
> There is an argument that we could have implemented all_text_mode to read
> this file and transform this data into this representation, where complex
> data is stored in text blobs as JSON:
>
> { "a" : "1" }
> { "a" : "{\"b\" : 3}" }
>
> Unfortunately, if this came out of the read, we would need to force users
> to use the covert_from(col, 'JSON') function to access the fields that are
> nested below the root level of the schema.
>
> We decided not to do this at the time all_text_mode was implemented. I do
> not believe we want to provide a workaround this right now as the work on
> embedded type, which will enable native support for all kinds of schema
> changes is ongoing.
>
> https://issues.apache.org/jira/browse/DRILL-3228
>
>
>
>
> On Thu, Sep 10, 2015 at 11:46 AM, Khurram Faraaz <kfaraaz@maprtech.com>
> wrote:
>
> > Hi,
> >
> >
> > Querying nested data from a JSON file with and with out setting
> > store.json.all_text_mode
> > results in errors.
> >
> >
> > Data that was used in the test is available here -
> >
> >
> https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5
> >
> > There is an option (on the top right) to Export as JSON.
> >
> > Drill commit ID : 0686bc23 (on master)
> >
> >
> > [root@centos-01 ~]# hadoop fs -ls /tmp/rows.json
> >
> > -rwxr-xr-x   3 root root  156417523 2015-09-10 17:18 /tmp/rows.json
> >
> >
> > [root@centos-01 food_inspection_data]# hadoop fs -put rows.json /tmp
> >
> > [root@centos-01 food_inspection_data]# cd
> /opt/mapr/drill/drill-1.2.0/bin
> >
> > [root@centos-01 bin]# ./sqlline -u "jdbc:drill:schema=dfs.tmp"
> >
> > apache drill 1.0.0
> >
> > "a little sql for your nosql"
> >
> >
> > Query 1)
> >
> >
> > 0: jdbc:drill:schema=dfs.tmp> select * from `rows.json`;
> >
> > Error: DATA_READ ERROR: You tried to start when you are using a
> ValueWriter
> > of type NullableVarCharWriterImpl.
> >
> >
> > File  /tmp/rows.json
> >
> > Record  1
> >
> > Line  1023
> >
> > Column  29
> >
> > Field  smallest
> >
> > Fragment 0:0
> >
> >
> > [Error Id: 0844933a-ed61-4137-9970-203576d5770d on
> centos-02.qa.lab:31010]
> > (state=,code=0)
> >
> >
> > *-----*
> >
> >
> > Query 2) Drill seems to be checking for correctness of data much deeper
> in
> > the nesting. Note that I do a count over field* meta, *and field*
> > "smallest" *reported in error is much deeper inside* meta, *
> > meta.view.columns.cachedContents.smallest
> >
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(meta) from `rows.json`;
> >
> > Error: DATA_READ ERROR: You tried to start when you are using a
> ValueWriter
> > of type NullableVarCharWriterImpl.
> >
> >
> > File  /tmp/rows.json
> >
> > Record  1
> >
> > Line  1023
> >
> > Column  29
> >
> > Field  smallest
> >
> > Fragment 0:0
> >
> >
> > [Error Id: 66e7f461-6671-4862-bde5-685c8c2304c9 on
> centos-04.qa.lab:31010]
> > (state=,code=0)
> >
> >
> > Explain plan for Query (2)
> >
> >
> > 0: jdbc:drill:schema=dfs.tmp> explain plan for select count(meta) from
> > `rows.json`;
> >
> > *+------+------+*
> >
> > *| **text** | **json** |*
> >
> > *+------+------+*
> >
> > *| *00-00    Screen
> >
> > 00-01      Project(EXPR$0=[$0])
> >
> > 00-02        StreamAgg(group=[{}], EXPR$0=[$SUM0($0)])
> >
> > 00-03          StreamAgg(group=[{}], EXPR$0=[COUNT($0)])
> >
> > 00-04            Scan(groupscan=[EasyGroupScan
> > [selectionRoot=maprfs:/tmp/rows.json, numFiles=1, columns=[`meta`],
> > files=[maprfs:///tmp/rows.json]]])
> >
> >
> > Query 3)
> >
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(t.meta.view) from `rows.json`
> t;
> >
> > Error: DATA_READ ERROR: You tried to start when you are using a
> ValueWriter
> > of type NullableVarCharWriterImpl.
> >
> >
> > File  /tmp/rows.json
> >
> > Record  1
> >
> > Line  1023
> >
> > Column  29
> >
> > Field  smallest
> >
> > Fragment 0:0
> >
> >
> > [Error Id: 55ea650c-9587-42d8-aec9-0f5a4b77ee4c on
> centos-04.qa.lab:31010]
> > (state=,code=0)
> >
> >
> > Query 4) - The count returned is wrong!
> >
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(t.meta.view.id) from
> > `rows.json`
> > t;
> >
> > *+---------+*
> >
> > *| **EXPR$0 ** |*
> >
> > *+---------+*
> >
> > *| *1      * |*
> >
> > *+---------+*
> >
> > 1 row selected (1.57 seconds)
> >
> > -------------------
> >
> > The below queries were run after json all text mode configuration was
> > enabled.
> >
> > 0: jdbc:drill:schema=dfs.tmp> alter session set
> > `store.json.all_text_mode`=true;
> >
> > *+-------+------------------------------------+*
> >
> > *| ** ok  ** | **             summary              ** |*
> >
> > *+-------+------------------------------------+*
> >
> > *| *true * | *store.json.all_text_mode updated. * |*
> >
> > *+-------+------------------------------------+*
> >
> > 1 row selected (0.218 seconds)
> >
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(t.meta.view) from `rows.json`
> t;
> >
> > Error: DATA_READ ERROR: Error parsing JSON - You tried to start when you
> > are using a ValueWriter of type NullableVarCharWriterImpl.
> >
> >
> > File  /tmp/rows.json
> >
> > Record  1
> >
> > Fragment 0:0
> >
> >
> > [Error Id: 5eeee0f3-69fd-445c-a149-a3aa2f18d1cb on
> centos-04.qa.lab:31010]
> > (state=,code=0)
> >
> >
> > COUNT returned in below query is wrong.
> >
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(t.meta.view.id) from
> > `rows.json`
> > t;
> >
> > +---------+
> >
> > | EXPR$0  |
> >
> > +---------+
> >
> > | 1       |
> >
> > +---------+
> >
> > 1 row selected (0.948 seconds)
> >
> >
> > Query to perform count over nested data.
> >
> >
> > 0: jdbc:drill:schema=dfs.tmp> select
> > count(t.meta.view.columns.cachedContents.smallest) from `rows.json` t;
> >
> > Error: DATA_READ ERROR: Error parsing JSON - You tried to start when you
> > are using a ValueWriter of type NullableVarCharWriterImpl.
> >
> >
> > File  /tmp/rows.json
> >
> > Record  1
> >
> > Fragment 0:0
> >
> >
> > [Error Id: 4ce0d5fe-abd7-47d9-8106-269fe37409e6 on
> centos-04.qa.lab:31010]
> > (state=,code=0)
> >
> >
> > 0: jdbc:drill:schema=dfs.tmp> select
> > t.meta.view.columns.cachedContents.smallest from `rows.json` t;
> >
> > Error: DATA_READ ERROR: Error parsing JSON - You tried to start when you
> > are using a ValueWriter of type NullableVarCharWriterImpl.
> >
> >
> > File  /tmp/rows.json
> >
> > Record  1
> >
> > Fragment 0:0
> >
> >
> > [Error Id: b2514414-af51-46c5-938d-ff792e6a8a8c on
> centos-04.qa.lab:31010]
> > (state=,code=0)
> >
> > 0: jdbc:drill:schema=dfs.tmp> select * from `rows.json`;
> >
> > Error: DATA_READ ERROR: Error parsing JSON - You tried to start when you
> > are using a ValueWriter of type NullableVarCharWriterImpl.
> >
> >
> > File  /tmp/rows.json
> >
> > Record  1
> >
> > Fragment 0:0
> >
> >
> > [Error Id: 3615e088-64cc-4cba-8cfe-602563f4e16c on
> centos-04.qa.lab:31010]
> > (state=,code=0)
> >
> > These are the occorrences of field named *smallest* in the JSON file
> >
> > Administrators-MacBook-Pro:Downloads kfaraaz$ grep "smallest" rows.json
> >
> >           "smallest" : "44247",
> >
> >           "smallest" : "'C' A CATERING AND EVENT COMPANY",
> >
> >           "smallest" : "'C' A CATERING AND EVENT COMPANY",
> >
> >           "smallest" : "0",
> >
> >           "smallest" : "(convenience store)",
> >
> >           "smallest" : "All",
> >
> >           "smallest" : " ",
> >
> >           "smallest" : "alsip",
> >
> >           "smallest" : "IL",
> >
> >           "smallest" : "60007",
> >
> >           "smallest" : "2010-01-04T00:00:00",
> >
> >           "smallest" : "1315 license reinspection",
> >
> >           "smallest" : "Business Not Located",
> >
> >           "smallest" : "1. SOURCE SOUND CONDITION, NO SPOILAGE, FOODS
> > PROPERLY LABELED, SHELLFISH TAGS IN PLACE - Comments: ",
> >
> >           "smallest" : "41.64467013219805",
> >
> >           "smallest" : "-87.91442843927047",
> >
> >           "smallest" : {
> >
> > , [ 101045, "180563EF-F00C-41F9-A0C2-7B2100AD2DED", 101045, 1441883778,
> > "386464", 1441883778, "386464", null, "343208", "SUBWAY # 26817-309 W.
> > MONROE", "SUBWAY", "1519940", "Restaurant", "Risk 1 (High)", "309 W
> MONROE
> > ST ", "CHICAGO", "IL", "60606", "2010-07-13T00:00:00", "Canvass", "Pass",
> > "32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED
> AND
> > MAINTAINED - Comments: All food and non-food contact equipment and
> utensils
> > shall be smooth, easily cleanable, and durable, and shall be in good
> > repair. A few of the smallest cutting boards, with deep/dark grooves must
> > be sanded/bleached- to make smooth/easily cleanable. | 33. FOOD AND
> > NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS -
> > Comments: All food and non-food contact surfaces of equipment and all
> food
> > storage utensils shall be thoroughly cleaned and sanitized daily.
> Interior
> > of ice chute at soda dispensers need cleaning. | 34. FLOORS: CONSTRUCTED
> > PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING
> > METHODS USED - Comments: The floors shall be constructed per code, be
> > smooth and easily cleaned, and be kept clean and in good repair. Floors
> of
> > walk-in cooler (around corners) need cleaning; baseboard at the outside
> > base of the walk-in cooler door is in poor repair, must fix.",
> > "41.88052826494154", "-87.63556938091433", [ null, "41.88052826494154",
> > "-87.63556938091433", null, false ] ]
> >
> > Thanks,
> > Khurram
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message