hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] popart opened a new issue #1329: [SUPPORT] Presto cannot query non-partitioned table
Date Wed, 12 Feb 2020 22:10:56 GMT
popart opened a new issue #1329: [SUPPORT] Presto cannot query non-partitioned table
URL: https://github.com/apache/incubator-hudi/issues/1329
 
 
   **Describe the problem you faced**
   
   I made a non-partitioned Hudi table using Spark. I was able to query it with Spark &
Presto, but when I tried querying it with Presto, I received the error `Could not find partitionDepth
in partition metafile`.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Use an an emr-5.28.0 cluster
   2. Run spark shell: 
   ```
   spark-shell --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
\
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --deploy-mode client
   ```
   3. Run spark code:
   ```
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   import org.apache.hudi.hive._
   import org.apache.hudi.keygen.NonpartitionedKeyGenerator
   
   val inputPath = "s3://path/to/a/parquet/file"
   val tableName = "my_test_table"
   val basePath = "s3://test-bucket/my_test_table" 
   
   val inputDf = spark.read.parquet(inputPath)
   
   val hudiOptions = Map[String,String](
       RECORDKEY_FIELD_OPT_KEY -> "dim_advertiser_id",
       PRECOMBINE_FIELD_OPT_KEY -> "update_time",
       TABLE_NAME -> tableName,
       KEYGENERATOR_CLASS_OPT_KEY -> classOf[NonpartitionedKeyGenerator].getCanonicalName,
//needed for non partitioned table
       HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[NonPartitionedExtractor].getCanonicalName,
//needed for non partitioned table
       OPERATION_OPT_KEY -> BULK_INSERT_OPERATION_OPT_VAL,
       HIVE_SYNC_ENABLED_OPT_KEY -> "true",
       HIVE_TABLE_OPT_KEY -> tableName,
       TABLE_TYPE_OPT_KEY -> COW_TABLE_TYPE_OPT_VAL,   "hoodie.bulkinsert.shuffle.parallelism"
-> "10")
   
   inputDf.write.format("org.apache.hudi").
       options(bulk_insert_hudiOptions).
       mode(Overwrite).
       save(basePath);
   ```
   4. Querying the table in Spark or Hive both work
   5. Querying the table in Presto fails
   ```
   [hadoop@ip-172-31-128-118 ~]$ presto-cli --catalog hive --schema default
   presto:default> select count(*) from my_test_table;
   
   Query 20200211_185123_00018_pruwt, FAILED, 1 node
   Splits: 17 total, 0 done (0.00%)
   0:02 [0 rows, 0B] [0 rows/s, 0B/s]
   
   Query 20200211_185123_00018_pruwt failed: Could not find partitionDepth in partition metafile
   com.facebook.presto.spi.PrestoException: Could not find partitionDepth in partition metafile
     at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:200)
     at com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
     at com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
     at com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
     at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.exception.HoodieException: Could not find partitionDepth in
partition metafile
     at org.apache.hudi.common.model.HoodiePartitionMetadata.getPartitionDepth(HoodiePartitionMetadata.java:75)
     at org.apache.hudi.hadoop.HoodieParquetInputFormat.getTableMetaClient(HoodieParquetInputFormat.java:209)
     at org.apache.hudi.hadoop.HoodieParquetInputFormat.groupFileStatus(HoodieParquetInputFormat.java:158)
     at org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:69)
     at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:288)
     at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:371)
     at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:264)
     at com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:96)
     at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:193)
     ... 7 more
   ```
   
   **Expected behavior**
   
   Presto should return a count of all the rows. Other Presto queries should succeed.
   
   **Environment Description**
   
   * EMR version: emr-5.28.0
   
   * Hudi version : 0.5.1-incubating
   
   * Spark version : 2.4.4
   
   * Hive version : 2.3.6
   
   * Hadoop version : 2.8.5
   
   * Presto version: 0.277
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   **Stacktrace**
   
   Included in "Steps to reproduce".
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message