hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lamber-ken (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HUDI-281) HiveSync failure through Spark when useJdbc is set to false
Date Wed, 01 Jan 2020 00:53:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006274#comment-17006274
] 

lamber-ken commented on HUDI-281:
---------------------------------

hi [~uditme], I think this pr may solve this issue, please tracks [https://github.com/apache/incubator-hudi/pull/1125]

> HiveSync failure through Spark when useJdbc is set to false
> -----------------------------------------------------------
>
>                 Key: HUDI-281
>                 URL: https://issues.apache.org/jira/browse/HUDI-281
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Hive Integration, Spark Integration, Usability
>            Reporter: Udit Mehrotra
>            Priority: Major
>
> Table creation with Hive sync through Spark fails, when I set *useJdbc* to *false*. Currently
I had to modify the code to set *useJdbc* to *false* as there is not *DataSourceOption* through
which I can specify this field when running Hudi code.
> Here is the failure:
> {noformat}
> java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/session/SessionState;
>   at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:527)
>   at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:517)
>   at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:507)
>   at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272)
>   at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132)
>   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
>   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
>   at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
>   at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
>   at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
>   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
>   at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229){noformat}
> I was expecting this to fail through Spark, becuase *hive-exec* is not shaded inside *hudi-spark-bundle*,
while *HiveConf* is shaded and relocated. This *SessionState* is coming from the spark-hive
jar and obviously it does not accept the relocated *HiveConf*.
> We in *EMR* are running into same problem when trying to integrate with Glue Catalog.
For this we have to create Hive metastore client through *Hive.get(conf).getMsc()* instead
of how it is being down now, so that alternate implementations of metastore can get created.
However, because hive-exec is not shaded but HiveConf is relocated we run into same issues
there.
> It would not be recommended to shade *hive-exec* either because it itself is an Uber
jar that shades a lot of things, and all of them would end up in *hudi-spark-bundle* jar.
We would not want to head that route. That is why, we would suggest if we consider removing
any shading of Hive libraries.
> We can add a *Maven Profile* to shade, but that means it has to be activated by default
otherwise it will fail default if *useJdbc* is set to false, and later when we commit *Glue
Catalog* changes.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message