hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Udit Mehrotra (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HUDI-1839) FSUtils getAllPartitions broken by NotSerializableException: org.apache.hadoop.fs.Path
Date Tue, 22 Jun 2021 20:22:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367656#comment-17367656
] 

Udit Mehrotra commented on HUDI-1839:
-------------------------------------

[~satishkotha] Were you using Kryo serializer while doing this testing ? There are some other
issues with these methods with the default spark serializer [https://github.com/apache/hudi/pull/2893#discussion_r654727713]
.

> FSUtils getAllPartitions broken by NotSerializableException: org.apache.hadoop.fs.Path
> --------------------------------------------------------------------------------------
>
>                 Key: HUDI-1839
>                 URL: https://issues.apache.org/jira/browse/HUDI-1839
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: satish
>            Priority: Blocker
>             Fix For: 0.9.0
>
>
> FSUtils getAllPartitionPaths is expected to work if metadata table is enabled or not.
It can also be called inside spark context. But looks like we are trying to improve parallelism
and causing NotSerializableExceptions. There are multiple callers using it within spark context
(clustering/cleaner).
> See stack trace below
> 21/04/20 17:28:44 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED
(diag message: User class threw exception: org.apache.hudi.exception.HoodieException: Error
fetching partition paths from metadata table
>  at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:321)
>  at org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy.generateClusteringPlan(PartitionAwareClusteringPlanStrategy.java:67)
>  at org.apache.hudi.table.action.cluster.SparkClusteringPlanActionExecutor.createClusteringPlan(SparkClusteringPlanActionExecutor.java:71)
>  at org.apache.hudi.table.action.cluster.BaseClusteringPlanActionExecutor.execute(BaseClusteringPlanActionExecutor.java:56)
>  at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleClustering(HoodieSparkCopyOnWriteTable.java:160)
>  at org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClusteringAtInstant(AbstractHoodieWriteClient.java:873)
>  at org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClustering(AbstractHoodieWriteClient.java:861)
>  at com.uber.data.efficiency.hudi.HudiRewriter.rewriteDataUsingHudi(HudiRewriter.java:111)
>  at com.uber.data.efficiency.hudi.HudiRewriter.main(HudiRewriter.java:50)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
>  Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Failed
to serialize task 53, not attempting to retry it. Exception during serialization: java.io.NotSerializableException:
org.apache.hadoop.fs.Path
>  Serialization stack:
>  - object not serializable (class: org.apache.hadoop.fs.Path, value: hdfs://...)
>  - element of array (index: 0)
>  - array (class [Ljava.lang.Object;, size 1)
>  - field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, type: class
[Ljava.lang.Object;)
>  - object (class scala.collection.mutable.WrappedArray$ofRef, WrappedArray(hdfs://...))
>  - writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition)
>  - object (class org.apache.spark.rdd.ParallelCollectionPartition, org.apache.spark.rdd.ParallelCollectionPartition@735)
>  - field (class: org.apache.spark.scheduler.ResultTask, name: partition, type: interface
org.apache.spark.Partition)
>  - object (class org.apache.spark.scheduler.ResultTask, ResultTask(1, 0))
>  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1904)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1892)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1891)
>  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1891)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935)
>  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935)
>  at scala.Option.foreach(Option.scala:257)
>  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935)
>  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2125)
>  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2074)
>  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2063)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:746)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2070)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2091)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2110)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2135)
>  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:968)
>  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:967)
>  at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361)
>  at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
>  at org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:79)
>  at org.apache.hudi.metadata.FileSystemBackedTableMetadata.getAllPartitionPaths(FileSystemBackedTableMetadata.java:79)
>  at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:319)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message