From commits-return-12330-archive-asf-public=cust-asf.ponee.io@hudi.apache.org  Sat Feb 29 21:39:06 2020
Return-Path: <commits-return-12330-archive-asf-public=cust-asf.ponee.io@hudi.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 8074918060E
	for <archive-asf-public@cust-asf.ponee.io>; Sat, 29 Feb 2020 22:39:06 +0100 (CET)
Received: (qmail 33377 invoked by uid 500); 29 Feb 2020 21:39:05 -0000
Mailing-List: contact commits-help@hudi.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:commits-help@hudi.apache.org>
List-Unsubscribe: <mailto:commits-unsubscribe@hudi.apache.org>
List-Post: <mailto:commits@hudi.apache.org>
List-Id: <commits.hudi.apache.org>
Reply-To: dev@hudi.apache.org
Delivered-To: mailing list commits@hudi.apache.org
Received: (qmail 33368 invoked by uid 99); 29 Feb 2020 21:39:05 -0000
Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Feb 2020 21:39:05 +0000
From: GitBox <git@apache.org>
To: commits@hudi.apache.org
Subject: [GitHub] [incubator-hudi] bhasudha commented on issue #1325: presto -
 querying nested object in parquet file created by hudi
Message-ID: <158301234552.9973.15477471365659834291.gitbox@gitbox.apache.org>
References: <hudi.1325.MDU6SXNzdWU1NjM0ODkwMDk=.gitbox@gitbox.apache.org>
In-Reply-To: <hudi.1325.MDU6SXNzdWU1NjM0ODkwMDk=.gitbox@gitbox.apache.org>
Date: Sat, 29 Feb 2020 21:39:05 -0000
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

bhasudha commented on issue #1325: presto - querying nested object in parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-592997350
 
 
   @adamjoneill I tried to reproduce this in local docker (little modification to Hudi docker demo setup) and could not.  I am able to select * on both the schemas ( with and without the simple identifier in the array element). Here is the setup and  reproducing steps.
   
   I am using the hoodie docker demo setup on latest master.
   Hudi version : 0.5.2-incubating
   Spark version : 2.4.4
   Hive version : Hive 2.3.3,
   Presto 0.217
   Hadoop version : 2.8.4
   Storage (HDFS/S3/GCS..) : HDFS
   Running on Docker? (yes/no) : yes
   
   I tried for both simple and wothout simple identifier inside the array item. Here is the set up I used:
   
   cat nested-table-kafka-source.properties
   	include=base.properties
   	# Key fields, for kafka example
   	hoodie.datasource.write.recordkey.field=id
   	hoodie.datasource.write.partitionpath.field=date
   	# Schema provider props (change to absolute path based on your installation)
   hoodie.deltastreamer.schemaprovider.source.schema.file=/var/demo/config/nested_table_schema.avsc
   hoodie.deltastreamer.schemaprovider.target.schema.file=/var/demo/config/nested_table_schema.avsc
   	# Kafka Source
   	hoodie.deltastreamer.source.kafka.topic=nested_table
   	#Kafka props
   	bootstrap.servers=kafkabroker:9092
   	auto.offset.reset=earliest
   
   
   cat nested_table_schema.avsc
   	{
   	  "name": "nested_table",
   	  "type": "record",
   	  "fields": [
   	    {
   	      "name": "id",
   	      "type": "int"
   	    },
   	    {
   	      "name": "date",
   	      "type":"string"
   	    },
   	    {
   	      "name": "foos",
   	      "type": {
   	        "type": "array",
   	        "items": {
   	          "name": "foos_record",
   	          "type": "record",
   	          "fields": [
   	            {  // remove this for without simple
   	              "name": "id",
   	              "type": "int"
   	            },
   	            {
   	              "name": "bar",
   	              "type": {
   	                "name": "bar",
   	                "type": "record",
   	                "fields": [
   	                  {
   	                    "name": "id",
   	                    "type": "int"
   	                  },
   	                  {
   	                    "name": "name",
   	                    "type": "string"
   	                  }
   	                ]
   	              }
   	            }
   	          ]
   	        }
   	      }
   	    }
   	  ]
   	}
   
   cat nested_table.json //remove the id entry inside foos[{elem}] for trying  without simple.
   	{"id":1,"date": "2020/02/15","foos":[{"id":11,"bar":{"id":1,"name":"OneBar"}},{"id":12,"bar":{"id":2,"name":"TwoBar"}},{"id":13,"bar":{"id":3,"name":"ThreeBar"}}]}
   	{"id":2,"date": "2020/02/16","foos":[{"id":21,"bar":{"id":2,"name":"OneBar"}},{"id":22,"bar":{"id":2,"name":"TwoBar"}},{"id":23,"bar":{"id":3,"name":"ThreeBar"}}]}
   	{"id":3,"date": "2020/02/17","foos":[{"id":31,"bar":{"id":3,"name":"OneBar"}},{"id":32,"bar":{"id":2,"name":"TwoBar"}},{"id":33,"bar":{"id":3,"name":"ThreeBar"}}]}
   	{"id":4,"date": "2020/02/18","foos":[{"id":41,"bar":{"id":4,"name":"OneBar"}},{"id":42,"bar":{"id":2,"name":"TwoBar"}},{"id":43,"bar":{"id":3,"name":"ThreeBar"}}]}
   
   
   Steps to publish to kafka, ingest from kafka and sync with hive. these are in the website - https://hudi.apache.org/docs/docker_demo.html. I modified it a bit for this example.
   // publich data from local to kafka. cd into hudi repo dir.
   cat docker/demo/data/nested_table.json | kafkacat -b kafkabroker -t nested_table -P
   
   //verify kafka publish using
   kafkacat -b kafkabroker -L -J | jq .
   
   // Now hop onto one of the adhoc containers
   docker exec -it adhoc-2 /bin/bash
   
   // using deltastreamer ingest data into hdfs
   spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --table-type COPY_ON_WRITE --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field id  --target-base-path /user/hive/warehouse/nested_table_cow --target-table nested_table_cow --props /var/demo/config/nested-table-kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
   
   // verify ingest by viewing this page in web hdfs
   http://namenode:50070/explorer.html#/user/hive/warehouse/nested_table_cow
   
   // sync to hive to make it available as a hive table.
   /var/hoodie/ws/hudi-hive/run_sync_tool.sh  --jdbc-url jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt --base-path /user/hive/warehouse/nested_table_cow --database default --table nested_table_cow
   
   // Now we can query from presto as below
   docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
   
   // presto query results when with simple id inside array element
   presto:default> use hive.default;
   presto:default> show columns from  nested_table_cow;
            Column         |                           Type                            |     Extra     | Comment
   ------------------------+-----------------------------------------------------------+---------------+---------
    _hoodie_commit_time    | varchar                                                   |               |
    _hoodie_commit_seqno   | varchar                                                   |               |
    _hoodie_record_key     | varchar                                                   |               |
    _hoodie_partition_path | varchar                                                   |               |
    _hoodie_file_name      | varchar                                                   |               |
    id                     | integer                                                   |               |
    date                   | varchar                                                   |               |
    foos                   | array(row(id integer, bar row(id integer, name varchar))) |               |
    dt                     | varchar                                                   | partition key |
   (9 rows)
   
   Query 20200229_203415_00003_ay9hw, FINISHED, 2 nodes
   Splits: 19 total, 19 done (100.00%)
   0:01 [9 rows, 792B] [12 rows/s, 1.11KB/s]
   
   presto:default>
   
   presto:default> select * from nested_table_cow;
    _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path |                           _hoodie_file_name                           | id |    date    |                                                   foos
   ---------------------+----------------------+--------------------+------------------------+-----------------------------------------------------------------------+----+------------+-------------------------------------------------------
    20200229201846      | 20200229201846_0_3   | 3                  | 2020/02/17             | 08524ae2-9874-4f52-8b8e-97a509696ee6-0_0-21-24_20200229201846.parquet |  3 | 2020/02/17 | [{id=31, bar={id=3, name=OneBar}}, {id=32, bar={id=2,
    20200229201846      | 20200229201846_3_4   | 2                  | 2020/02/16             | 3ed41232-08a9-41ff-b97c-7b964c8d7978-0_3-21-27_20200229201846.parquet |  2 | 2020/02/16 | [{id=21, bar={id=2, name=OneBar}}, {id=22, bar={id=2,
    20200229201846      | 20200229201846_1_1   | 4                  | 2020/02/18             | 85fa663a-c86f-43f7-94a5-a813a0a67314-0_1-21-25_20200229201846.parquet |  4 | 2020/02/18 | [{id=41, bar={id=4, name=OneBar}}, {id=42, bar={id=2,
    20200229201846      | 20200229201846_2_2   | 1                  | 2020/02/15             | 916d9d23-da62-4d6b-bb9d-85889e8c6ab8-0_2-21-26_20200229201846.parquet |  1 | 2020/02/15 | [{id=11, bar={id=1, name=OneBar}}, {id=12, bar={id=2,
   (4 rows)
   
   Query 20200229_204101_00005_ay9hw, FINISHED, 1 node
   Splits: 20 total, 20 done (100.00%)
   0:05 [4 rows, 5KB] [0 rows/s, 1.02KB/s]
   
   presto:default>
   
   
   // presto query results when removing the simple identifier
   presto:default> show columns from  nested_table_cow;
            Column         |                     Type                      |     Extra     | Comment
   ------------------------+-----------------------------------------------+---------------+---------
    _hoodie_commit_time    | varchar                                       |               |
    _hoodie_commit_seqno   | varchar                                       |               |
    _hoodie_record_key     | varchar                                       |               |
    _hoodie_partition_path | varchar                                       |               |
    _hoodie_file_name      | varchar                                       |               |
    id                     | integer                                       |               |
    date                   | varchar                                       |               |
    foos                   | array(row(bar row(id integer, name varchar))) |               |
    dt                     | varchar                                       | partition key |
   (9 rows)
   
   Query 20200229_205313_00002_xjxz7, FINISHED, 2 nodes
   Splits: 19 total, 19 done (100.00%)
   0:02 [9 rows, 780B] [4 rows/s, 387B/s]
   
   presto:default> select * from nested_table_cow;
    _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path |                           _hoodie_file_name                           | id |    date    |                                        foos
   ---------------------+----------------------+--------------------+------------------------+-----------------------------------------------------------------------+----+------------+-------------------------------------------------------
    20200229204957      | 20200229204957_2_1   | 1                  | 2020/02/15             | fb63e6c8-bd12-442b-bdc3-26748f33a4a8-0_2-21-26_20200229204957.parquet |  1 | 2020/02/15 | [{bar={id=1, name=OneBar}}, {bar={id=2, name=TwoBar}},
    20200229204957      | 20200229204957_0_2   | 3                  | 2020/02/17             | 3722865d-7559-4b1c-94e2-6bf993c4d4a0-0_0-21-24_20200229204957.parquet |  3 | 2020/02/17 | [{bar={id=3, name=OneBar}}, {bar={id=2, name=TwoBar}},
    20200229204957      | 20200229204957_1_3   | 4                  | 2020/02/18             | 6e70cd2b-d922-425d-b94b-1f35691758b2-0_1-21-25_20200229204957.parquet |  4 | 2020/02/18 | [{bar={id=4, name=OneBar}}, {bar={id=2, name=TwoBar}},
    20200229204957      | 20200229204957_3_4   | 2                  | 2020/02/16             | f2a71bf6-ee01-4310-a8f3-2c4b1b1ab34e-0_3-21-27_20200229204957.parquet |  2 | 2020/02/16 | [{bar={id=2, name=OneBar}}, {bar={id=2, name=TwoBar}},
   (4 rows)
   
   Query 20200229_205341_00003_xjxz7, FINISHED, 1 node
   Splits: 20 total, 20 done (100.00%)
   0:23 [4 rows, 4.68KB] [0 rows/s, 212B/s]
   
   presto:default>
   
   
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services