spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Davies Liu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-6360) For Spark 1.1 and 1.2, after any RDD transformations, calling saveAsParquetFile over a SchemaRDD with decimal or UDT column throws
Date Thu, 06 Aug 2015 23:22:10 GMT

     [ https://issues.apache.org/jira/browse/SPARK-6360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Davies Liu updated SPARK-6360:
------------------------------
    Target Version/s:   (was: 1.2.3)

> For Spark 1.1 and 1.2, after any RDD transformations, calling saveAsParquetFile over
a SchemaRDD with decimal or UDT column throws
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6360
>                 URL: https://issues.apache.org/jira/browse/SPARK-6360
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: Cheng Lian
>             Fix For: 1.5.0
>
>
> Spark shell session for reproduction (use {{:paste}}):
> {noformat}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.catalyst.types.decimal._
> import org.apache.spark.sql.catalyst.types._
> import org.apache.hadoop.fs._
> val sqlContext = new SQLContext(sc)
> val fs = FileSystem.get(sc.hadoopConfiguration)
> fs.delete(new Path("a.parquet"))
> fs.delete(new Path("b.parquet"))
> import sc._
> import sqlContext._
> val r1 = parallelize(1 to 10).map(i => Tuple1(Decimal(i, 10, 0))).select('_1 cast
DecimalType(10, 0))
> // OK
> r1.saveAsParquetFile("a.parquet")
> val r2 = parallelize(1 to 10).map(i => Tuple1(Decimal(i, 10, 0))).select('_1 cast
DecimalType(10, 0))
> val r3 = r2.coalesce(1)
> // Error
> r3.saveAsParquetFile("b.parquet")
> {noformat}
> Exception thrown:
> {noformat}
> java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to org.apache.spark.sql.catalyst.types.decimal.Decimal
>         at org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359)
>         at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:328)
>         at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:314)
>         at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>         at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>         at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>         at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:308)
>         at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
>         at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 15/03/17 00:04:13 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost):
java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to org.apache.spark.sql.catalyst.types.decimal.Decimal
>         at org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359)
>         at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:328)
>         at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:314)
>         at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>         at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>         at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>         at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:308)
>         at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
>         at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The query plan of {{r1}} is:
> {noformat}
> == Parsed Logical Plan ==
> 'Project [CAST('_1, DecimalType(10,0)) AS c0#60]
>  LogicalRDD [_1#59], MapPartitionsRDD[71] at mapPartitions at ExistingRDD.scala:36
> == Analyzed Logical Plan ==
> Project [CAST(_1#59, DecimalType(10,0)) AS c0#60]
>  LogicalRDD [_1#59], MapPartitionsRDD[71] at mapPartitions at ExistingRDD.scala:36
> == Optimized Logical Plan ==
> Project [CAST(_1#59, DecimalType(10,0)) AS c0#60]
>  LogicalRDD [_1#59], MapPartitionsRDD[71] at mapPartitions at ExistingRDD.scala:36
> == Physical Plan ==
> Project [CAST(_1#59, DecimalType(10,0)) AS c0#60]
>  PhysicalRDD [_1#59], MapPartitionsRDD[71] at mapPartitions at ExistingRDD.scala:36
> Code Generation: false
> == RDD ==
> {noformat}
> while {{r3}}'s query plan is:
> {noformat}
> == Parsed Logical Plan ==
> LogicalRDD [c0#61], CoalescedRDD[74] at coalesce at SchemaRDD.scala:456
> == Analyzed Logical Plan ==
> LogicalRDD [c0#61], CoalescedRDD[74] at coalesce at SchemaRDD.scala:456
> == Optimized Logical Plan ==
> LogicalRDD [c0#61], CoalescedRDD[74] at coalesce at SchemaRDD.scala:456
> == Physical Plan ==
> PhysicalRDD [c0#61], CoalescedRDD[74] at coalesce at SchemaRDD.scala:456
> Code Generation: false
> == RDD ==
> {noformat}
> The key difference here is that, {{r3}} wraps an existing {{SchemaRDD}} ({{r2}}, beneath
the {{CoalescedRDD}}). While evaluating {{r3}}, {{r2.compute}} is called, which calls {{ScalaReflection.convertRowToScala}}.
Here, Catalyst {{Decimal}} values are converted into Java {{BigDecimal}}s, and finally causes
the exception.
> Note that {{DataFrame}} in Spark 1.3 doesn't suffer this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message