hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] adamjoneill commented on issue #1325: presto - querying nested object in parquet file created by hudi
Date Thu, 13 Feb 2020 19:31:36 GMT
adamjoneill commented on issue #1325: presto - querying nested object in parquet file created
by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585932919
 
 
   @vinothchandar from my investigation above it would suggest it to be how hudi writes parquet
data. 
   
   Whilst limited in its scope, and many moving parts, my investigation involved 
   
   1. taking a record that includes an array of complex objects (no primitive or "simple"
types belong to the array item object) off the kinesis stream
   2. saving it to parquet in S3 using the dataFrame API
   3. then using the same record, save it using hudi to S3 
   4. AWS Glue crawls over these files, creates database and tables
   5. presto query `select * from table` against hudi parquet file fails
   6. presto query `select * from table` spark api file succeeds
   
   I agree it does seem strange and the stack trace does point to a presto issue with reading
the array. Unfortunately I'm not 100% across the project to know where to begin debugging
the issue. What can I do to find out further?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message