From commits-return-12330-archive-asf-public=cust-asf.ponee.io@hudi.apache.org Sat Feb 29 21:39:06 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8074918060E for ; Sat, 29 Feb 2020 22:39:06 +0100 (CET) Received: (qmail 33377 invoked by uid 500); 29 Feb 2020 21:39:05 -0000 Mailing-List: contact commits-help@hudi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hudi.apache.org Delivered-To: mailing list commits@hudi.apache.org Received: (qmail 33368 invoked by uid 99); 29 Feb 2020 21:39:05 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Feb 2020 21:39:05 +0000 From: GitBox To: commits@hudi.apache.org Subject: [GitHub] [incubator-hudi] bhasudha commented on issue #1325: presto - querying nested object in parquet file created by hudi Message-ID: <158301234552.9973.15477471365659834291.gitbox@gitbox.apache.org> References: In-Reply-To: Date: Sat, 29 Feb 2020 21:39:05 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit bhasudha commented on issue #1325: presto - querying nested object in parquet file created by hudi URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-592997350 @adamjoneill I tried to reproduce this in local docker (little modification to Hudi docker demo setup) and could not. I am able to select * on both the schemas ( with and without the simple identifier in the array element). Here is the setup and reproducing steps. I am using the hoodie docker demo setup on latest master. Hudi version : 0.5.2-incubating Spark version : 2.4.4 Hive version : Hive 2.3.3, Presto 0.217 Hadoop version : 2.8.4 Storage (HDFS/S3/GCS..) : HDFS Running on Docker? (yes/no) : yes I tried for both simple and wothout simple identifier inside the array item. Here is the set up I used: cat nested-table-kafka-source.properties include=base.properties # Key fields, for kafka example hoodie.datasource.write.recordkey.field=id hoodie.datasource.write.partitionpath.field=date # Schema provider props (change to absolute path based on your installation) hoodie.deltastreamer.schemaprovider.source.schema.file=/var/demo/config/nested_table_schema.avsc hoodie.deltastreamer.schemaprovider.target.schema.file=/var/demo/config/nested_table_schema.avsc # Kafka Source hoodie.deltastreamer.source.kafka.topic=nested_table #Kafka props bootstrap.servers=kafkabroker:9092 auto.offset.reset=earliest cat nested_table_schema.avsc { "name": "nested_table", "type": "record", "fields": [ { "name": "id", "type": "int" }, { "name": "date", "type":"string" }, { "name": "foos", "type": { "type": "array", "items": { "name": "foos_record", "type": "record", "fields": [ { // remove this for without simple "name": "id", "type": "int" }, { "name": "bar", "type": { "name": "bar", "type": "record", "fields": [ { "name": "id", "type": "int" }, { "name": "name", "type": "string" } ] } } ] } } } ] } cat nested_table.json //remove the id entry inside foos[{elem}] for trying without simple. {"id":1,"date": "2020/02/15","foos":[{"id":11,"bar":{"id":1,"name":"OneBar"}},{"id":12,"bar":{"id":2,"name":"TwoBar"}},{"id":13,"bar":{"id":3,"name":"ThreeBar"}}]} {"id":2,"date": "2020/02/16","foos":[{"id":21,"bar":{"id":2,"name":"OneBar"}},{"id":22,"bar":{"id":2,"name":"TwoBar"}},{"id":23,"bar":{"id":3,"name":"ThreeBar"}}]} {"id":3,"date": "2020/02/17","foos":[{"id":31,"bar":{"id":3,"name":"OneBar"}},{"id":32,"bar":{"id":2,"name":"TwoBar"}},{"id":33,"bar":{"id":3,"name":"ThreeBar"}}]} {"id":4,"date": "2020/02/18","foos":[{"id":41,"bar":{"id":4,"name":"OneBar"}},{"id":42,"bar":{"id":2,"name":"TwoBar"}},{"id":43,"bar":{"id":3,"name":"ThreeBar"}}]} Steps to publish to kafka, ingest from kafka and sync with hive. these are in the website - https://hudi.apache.org/docs/docker_demo.html. I modified it a bit for this example. // publich data from local to kafka. cd into hudi repo dir. cat docker/demo/data/nested_table.json | kafkacat -b kafkabroker -t nested_table -P //verify kafka publish using kafkacat -b kafkabroker -L -J | jq . // Now hop onto one of the adhoc containers docker exec -it adhoc-2 /bin/bash // using deltastreamer ingest data into hdfs spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --table-type COPY_ON_WRITE --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field id --target-base-path /user/hive/warehouse/nested_table_cow --target-table nested_table_cow --props /var/demo/config/nested-table-kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider // verify ingest by viewing this page in web hdfs http://namenode:50070/explorer.html#/user/hive/warehouse/nested_table_cow // sync to hive to make it available as a hive table. /var/hoodie/ws/hudi-hive/run_sync_tool.sh --jdbc-url jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt --base-path /user/hive/warehouse/nested_table_cow --database default --table nested_table_cow // Now we can query from presto as below docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090 // presto query results when with simple id inside array element presto:default> use hive.default; presto:default> show columns from nested_table_cow; Column | Type | Extra | Comment ------------------------+-----------------------------------------------------------+---------------+--------- _hoodie_commit_time | varchar | | _hoodie_commit_seqno | varchar | | _hoodie_record_key | varchar | | _hoodie_partition_path | varchar | | _hoodie_file_name | varchar | | id | integer | | date | varchar | | foos | array(row(id integer, bar row(id integer, name varchar))) | | dt | varchar | partition key | (9 rows) Query 20200229_203415_00003_ay9hw, FINISHED, 2 nodes Splits: 19 total, 19 done (100.00%) 0:01 [9 rows, 792B] [12 rows/s, 1.11KB/s] presto:default> presto:default> select * from nested_table_cow; _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name | id | date | foos ---------------------+----------------------+--------------------+------------------------+-----------------------------------------------------------------------+----+------------+------------------------------------------------------- 20200229201846 | 20200229201846_0_3 | 3 | 2020/02/17 | 08524ae2-9874-4f52-8b8e-97a509696ee6-0_0-21-24_20200229201846.parquet | 3 | 2020/02/17 | [{id=31, bar={id=3, name=OneBar}}, {id=32, bar={id=2, 20200229201846 | 20200229201846_3_4 | 2 | 2020/02/16 | 3ed41232-08a9-41ff-b97c-7b964c8d7978-0_3-21-27_20200229201846.parquet | 2 | 2020/02/16 | [{id=21, bar={id=2, name=OneBar}}, {id=22, bar={id=2, 20200229201846 | 20200229201846_1_1 | 4 | 2020/02/18 | 85fa663a-c86f-43f7-94a5-a813a0a67314-0_1-21-25_20200229201846.parquet | 4 | 2020/02/18 | [{id=41, bar={id=4, name=OneBar}}, {id=42, bar={id=2, 20200229201846 | 20200229201846_2_2 | 1 | 2020/02/15 | 916d9d23-da62-4d6b-bb9d-85889e8c6ab8-0_2-21-26_20200229201846.parquet | 1 | 2020/02/15 | [{id=11, bar={id=1, name=OneBar}}, {id=12, bar={id=2, (4 rows) Query 20200229_204101_00005_ay9hw, FINISHED, 1 node Splits: 20 total, 20 done (100.00%) 0:05 [4 rows, 5KB] [0 rows/s, 1.02KB/s] presto:default> // presto query results when removing the simple identifier presto:default> show columns from nested_table_cow; Column | Type | Extra | Comment ------------------------+-----------------------------------------------+---------------+--------- _hoodie_commit_time | varchar | | _hoodie_commit_seqno | varchar | | _hoodie_record_key | varchar | | _hoodie_partition_path | varchar | | _hoodie_file_name | varchar | | id | integer | | date | varchar | | foos | array(row(bar row(id integer, name varchar))) | | dt | varchar | partition key | (9 rows) Query 20200229_205313_00002_xjxz7, FINISHED, 2 nodes Splits: 19 total, 19 done (100.00%) 0:02 [9 rows, 780B] [4 rows/s, 387B/s] presto:default> select * from nested_table_cow; _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name | id | date | foos ---------------------+----------------------+--------------------+------------------------+-----------------------------------------------------------------------+----+------------+------------------------------------------------------- 20200229204957 | 20200229204957_2_1 | 1 | 2020/02/15 | fb63e6c8-bd12-442b-bdc3-26748f33a4a8-0_2-21-26_20200229204957.parquet | 1 | 2020/02/15 | [{bar={id=1, name=OneBar}}, {bar={id=2, name=TwoBar}}, 20200229204957 | 20200229204957_0_2 | 3 | 2020/02/17 | 3722865d-7559-4b1c-94e2-6bf993c4d4a0-0_0-21-24_20200229204957.parquet | 3 | 2020/02/17 | [{bar={id=3, name=OneBar}}, {bar={id=2, name=TwoBar}}, 20200229204957 | 20200229204957_1_3 | 4 | 2020/02/18 | 6e70cd2b-d922-425d-b94b-1f35691758b2-0_1-21-25_20200229204957.parquet | 4 | 2020/02/18 | [{bar={id=4, name=OneBar}}, {bar={id=2, name=TwoBar}}, 20200229204957 | 20200229204957_3_4 | 2 | 2020/02/16 | f2a71bf6-ee01-4310-a8f3-2c4b1b1ab34e-0_3-21-27_20200229204957.parquet | 2 | 2020/02/16 | [{bar={id=2, name=OneBar}}, {bar={id=2, name=TwoBar}}, (4 rows) Query 20200229_205341_00003_xjxz7, FINISHED, 1 node Splits: 20 total, 20 done (100.00%) 0:23 [4 rows, 4.68KB] [0 rows/s, 212B/s] presto:default> ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services