hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cosmin Iordache (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (HUDI-83) Map Timestamp type in spark to corresponding Timestamp type in Hive during Hive sync
Date Tue, 17 Mar 2020 13:41:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060915#comment-17060915
] 

Cosmin Iordache edited comment on HUDI-83 at 3/17/20, 1:40 PM:
---------------------------------------------------------------

I'm going to add here some more info regarding backwards compatibility. 
 Continuing investigation : I created a data-set with previous version ( spark-2.3.3 - hudi
0.5.0). The columns would be a StructType - TimestampType in the dataframe
{code:java}
15:28:22.469 [DataLakeSystem-akka.actor.default-dispatcher-4] INFO org.apache.hudi.HoodieSparkSqlWriter$
- Registered avro schema : {
 "type" : "record",
 "name" : "test_record",
 "namespace" : "hoodie.test",
 "fields" : [ 
 ...
{ "name" : "other_date", "type" : [ "long", "null" ] }
,
{ "name" : "timestamp_1", "type" : [ "long", "null" ] }
, 
 ...
 ]
 }
{code}
Above is a timestamp saved as a long ( expected ).
 I then created a new upsert with the same dataframe in (spark 2.4.4 , hudi 0.5.1) :
{code:java}
 {
 "type" : "record",
 "name" : "test_record",
 "namespace" : "hoodie.test",
 "fields": {
 ...
 "name" : "other_date",
 "type" : [
{ "type" : "long", "logicalType" : "timestamp-micros" }
, "null" ]
 }, {
 "name" : "timestamp_1",
 "type" : [
{ "type" : "long", "logicalType" : "timestamp-micros" }
, "null" ]
 }
 ...
{code}
Ingestion worked. But on reading :
  
{code:java}
scala> q1.select("other_date","timestamp_1").show()
 53277 [Executor task launch worker for task 3] ERROR org.apache.spark.executor.Executor -
Exception in task 1.0 in stage 2.0 (TID 3)
 org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted
in file [hdfs://namenode:8020/data/lake/3aa1f8f7-63a4-4aa4-986e-0ed835601999/converted/io_yields_hashkey/-2/8bddbe9e-d0d5-492a-b96d-9b196322a317-0_0-49-79_20200316152824.parquet]
 . Column: [other_date], Expected: timestamp, Found: INT64
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
Source)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:636)
 at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:255)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:836)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:836)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:123)
 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:250)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:440)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:208)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
 at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
 ... 22 more
  
{code}
 


was (Author: arw357):
I'm going to add here some more info regarding backwards compatibility. 
 Continuing investigation : I created a data-set with previous version ( spark-2.3.3 - hudi
0.5.0)
{code:java}
15:28:22.469 [DataLakeSystem-akka.actor.default-dispatcher-4] INFO org.apache.hudi.HoodieSparkSqlWriter$
- Registered avro schema : {
 "type" : "record",
 "name" : "test_record",
 "namespace" : "hoodie.test",
 "fields" : [ 
 ...
{ "name" : "other_date", "type" : [ "long", "null" ] }
,
{ "name" : "timestamp_1", "type" : [ "long", "null" ] }
, 
 ...
 ]
 }
{code}

 Above is a timestamp saved as a long ( expected ).
 I then created a new upsert with the same dataframe in (spark 2.4.4 , hudi 0.5.1) :
{code:java}
 {
 "type" : "record",
 "name" : "test_record",
 "namespace" : "hoodie.test",
 "fields": {
 ...
 "name" : "other_date",
 "type" : [
{ "type" : "long", "logicalType" : "timestamp-micros" }
, "null" ]
 }, {
 "name" : "timestamp_1",
 "type" : [
{ "type" : "long", "logicalType" : "timestamp-micros" }
, "null" ]
 }
 ...
{code}

 Ingestion worked. But on reading :
  
{code:java}
scala> q1.select("other_date","timestamp_1").show()
 53277 [Executor task launch worker for task 3] ERROR org.apache.spark.executor.Executor -
Exception in task 1.0 in stage 2.0 (TID 3)
 org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted
in file [hdfs://namenode:8020/data/lake/3aa1f8f7-63a4-4aa4-986e-0ed835601999/converted/io_yields_hashkey/-2/8bddbe9e-d0d5-492a-b96d-9b196322a317-0_0-49-79_20200316152824.parquet]
 . Column: [other_date], Expected: timestamp, Found: INT64
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
Source)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:636)
 at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:255)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:836)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:836)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:123)
 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:250)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:440)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:208)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
 at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
 ... 22 more
  
{code}

  

> Map Timestamp type in spark to corresponding Timestamp type in Hive during Hive sync
> ------------------------------------------------------------------------------------
>
>                 Key: HUDI-83
>                 URL: https://issues.apache.org/jira/browse/HUDI-83
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>          Components: Hive Integration, Usability
>            Reporter: Vinoth Chandar
>            Priority: Major
>             Fix For: 0.6.0
>
>
> [https://github.com/apache/incubator-hudi/issues/543] & related issues 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message