hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] adamjoneill opened a new issue #1325: presto - querying nested object in parquet file created by hudi
Date Tue, 11 Feb 2020 21:28:07 GMT
adamjoneill opened a new issue #1325: presto - querying nested object in parquet file created
by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325
 
 
   **Describe the problem you faced**
   
   Using an AWS EMR spark job to create a hudi parquet record in S3 from a kinesis stream.
Querying this record from presto is fine, but I can't seem to query a nested column
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Spark job that reads from kinesis stream, saves hudi file to S3
   2. AWS glue job creates database from record
   3. Log into AWS EMR with presto installed
   4. run `presto-cli --catalog hive --schema schema --server server:8889`
   5. queries:
   
   works without nesting
   ```
   presto:schema> select id from default;
       id    
   ----------
    34551832 
   (1 row)
   
   Query 20200211_212022_00055_hej8h, FINISHED, 1 node
   Splits: 17 total, 17 done (100.00%)
   0:01 [1 rows, 93B] [1 rows/s, 179B/s]
   ```
   query that doesn't work with nesting
   ```
   presto:schema> select id, order.channel from default;
   Query 20200211_212107_00056_hej8h failed: line 1:12: mismatched input 'order'. Expecting:
'*', <expression>, <identifier>
   select id, order.channel from default
   ```
   
   table structure
   
   ```
   presto:data-lake-database-dev-adam-8> show columns from default
                                      -> ;
            Column         |                                                             
                                                                                         
                                                                             
   ------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    _hoodie_commit_time    | varchar                                                     
                                                                                         
                                                                             
    _hoodie_commit_seqno   | varchar                                                     
                                                                                         
                                                                             
    _hoodie_record_key     | varchar                                                     
                                                                                         
                                                                             
    _hoodie_partition_path | varchar                                                     
                                                                                         
                                                                             
    _hoodie_file_name      | varchar                                                     
                                                                                         
                                                                             
    eventtimestamp         | varchar                                                     
                                                                                         
                                                                             
    id                     | bigint                                                      
                                                                                         
                                                                             
    order                  | row(channel varchar, customer row(address row(country varchar,
postcode varchar, region varchar), birthdate varchar, createddate varchar, email varchar,
firstname varchar, id bigi
   (11 rows)
   ```
   
   **Expected behavior**
   
   Nest row object to be output in query result.
   
   **Environment Description**
   
   * Hudi version : hudi-spark-bundle:0.5.0-incubating, (with org.apache.spark:spark-avro_2.11:2.4.4)
   
   * Spark version : 2.4.4
   
   * Hive version : Hive 2.3.6, 
   
   * Pig 0.17.0, 
   
   * Presto 0.227
   
   * Hadoop version : Amazon 2.8.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message