From commits-return-11430-archive-asf-public=cust-asf.ponee.io@hudi.apache.org Wed Feb 12 18:05:45 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id D9808180657 for ; Wed, 12 Feb 2020 19:05:44 +0100 (CET) Received: (qmail 72112 invoked by uid 500); 12 Feb 2020 18:05:44 -0000 Mailing-List: contact commits-help@hudi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hudi.apache.org Delivered-To: mailing list commits@hudi.apache.org Received: (qmail 72101 invoked by uid 99); 12 Feb 2020 18:05:44 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Feb 2020 18:05:44 +0000 From: GitBox To: commits@hudi.apache.org Subject: [GitHub] [incubator-hudi] adamjoneill commented on issue #1325: presto - querying nested object in parquet file created by hudi Message-ID: <158153074414.17625.1268179671461777267.gitbox@gitbox.apache.org> References: In-Reply-To: Date: Wed, 12 Feb 2020 18:05:44 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit adamjoneill commented on issue #1325: presto - querying nested object in parquet file created by hudi URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585338502 i've managed to narrow down the issue to the data that is coming off the kinesis stream. when i replace the data from the stream with some test data as follows with the following code: ``` if (!rdd.isEmpty()){ val json = rdd.map(record=>new String(record)) val dataFrame = spark.read.json(json) dataFrame.printSchema(); dataFrame.show(); } val hudiTableName = "order" val hudiTablePath = path + hudiTableName val hudiOptions = Map[String,String]( DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id", HoodieWriteConfig.TABLE_NAME -> hudiTableName, DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "id") // Write data into the Hudi dataset dataFrame.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath) ``` i replaced ``` val dataFrame = spark.read.json(json) ``` with ``` val dataFrame = sparkContext.parallelize(Seq(Foo(1, Bar(1, "first")), Foo(2, Bar(2, "second")))).toDF() ``` and the `select * from table` worked as well as nested query `select id, bar.id, bar.name from table` So at this stage it's looking like there's an issue with the data and how it's coming off the kinesis stream ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services