spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ganesh Krishnan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-13795) ClassCast Exception while attempting to show() a DataFrame
Date Thu, 10 Mar 2016 22:29:40 GMT

     [ https://issues.apache.org/jira/browse/SPARK-13795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ganesh Krishnan updated SPARK-13795:
------------------------------------
    Description: 
DataFrame Schema (by printSchema() ) is as follows

allDataJoined.printSchema() 

{noformat}

 |-- eventType: string (nullable = true)
 |-- itemId: string (nullable = true)
 |-- productId: string (nullable = true)
 |-- productVersion: string (nullable = true)
 |-- servicedBy: string (nullable = true)
 |-- ACCOUNT_NAME: string (nullable = true)
 |-- CONTENTGROUPID: string (nullable = true)
 |-- PRODUCT_ID: string (nullable = true)
 |-- PROFILE_ID: string (nullable = true)
 |-- SALESADVISEREMAIL: string (nullable = true)
 |-- businessName: string (nullable = true)
 |-- contentGroupId: string (nullable = true)
 |-- salesAdviserName: string (nullable = true)
 |-- salesAdviserPhone: string (nullable = true)

{noformat}

There is NO column that has any datatype except String. There used to be previously an inferred
column of type long that was dropped  
 
{code}

DataFrame allDataJoined = whiteEventJoinedWithReference.
                       drop(rliDataFrame.col("occurredAtDate"));
allDataJoined.printSchema() : output above ^^
Now 
allDataJoined.show() 
 
{code}

throws the following exception vv

{noformat}

java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
	at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
	at scala.math.Ordering$Int$.compare(Ordering.scala:256)
	at scala.math.Ordering$class.gt(Ordering.scala:97)
	at scala.math.Ordering$Int$.gt(Ordering.scala:256)
	at org.apache.spark.sql.catalyst.expressions.GreaterThan.nullSafeEval(predicates.scala:457)
	at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:383)
	at org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:238)
	at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
	at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
	at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
	at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.prunePartitions(DataSourceStrategy.scala:257)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:82)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
	at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.makeBroadcastHashJoin(SparkStrategies.scala:88)
	at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:97)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
	at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
	at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:349)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:47)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:45)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:52)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:52)
	at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2134)
	at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1413)
	at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1495)
	at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:171)
	at org.apache.spark.sql.DataFrame.show(DataFrame.scala:394)
	at org.apache.spark.sql.DataFrame.show(DataFrame.scala:355)
	at org.apache.spark.sql.DataFrame.show(DataFrame.scala:363)

{noformat}

Checked, googled, stackoverflowed with no results.
Edit: I managed to narrow down this bug to this usecase scenario:
The raw json has the field dateOccuredAt and also the parquet it is being written to also
has partition dateOccuredAt. The raw JSON field is being inferred as String while the partition
is inferred as long which is correct too. However while persisting we have the above error
even if the column dateOccuredAt is dropped from the DataFrame

 Also, we use Java and not Scala.


  was:
DataFrame Schema (by printSchema() ) is as follows

allDataJoined.printSchema() 

{noformat}

 |-- eventType: string (nullable = true)
 |-- itemId: string (nullable = true)
 |-- productId: string (nullable = true)
 |-- productVersion: string (nullable = true)
 |-- servicedBy: string (nullable = true)
 |-- ACCOUNT_NAME: string (nullable = true)
 |-- CONTENTGROUPID: string (nullable = true)
 |-- PRODUCT_ID: string (nullable = true)
 |-- PROFILE_ID: string (nullable = true)
 |-- SALESADVISEREMAIL: string (nullable = true)
 |-- businessName: string (nullable = true)
 |-- contentGroupId: string (nullable = true)
 |-- salesAdviserName: string (nullable = true)
 |-- salesAdviserPhone: string (nullable = true)

{noformat}

There is NO column that has any datatype except String. There used to be previously an inferred
column of type long that was dropped  
 
{code}

DataFrame allDataJoined = whiteEventJoinedWithReference.
                       drop(rliDataFrame.col("occurredAtDate"));
allDataJoined.printSchema() : output above ^^
Now 
allDataJoined.show() 
 
{code}

throws the following exception vv

{noformat}

java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
	at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
	at scala.math.Ordering$Int$.compare(Ordering.scala:256)
	at scala.math.Ordering$class.gt(Ordering.scala:97)
	at scala.math.Ordering$Int$.gt(Ordering.scala:256)
	at org.apache.spark.sql.catalyst.expressions.GreaterThan.nullSafeEval(predicates.scala:457)
	at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:383)
	at org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:238)
	at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
	at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
	at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
	at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.prunePartitions(DataSourceStrategy.scala:257)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:82)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
	at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.makeBroadcastHashJoin(SparkStrategies.scala:88)
	at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:97)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
	at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
	at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:349)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:47)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:45)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:52)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:52)
	at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2134)
	at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1413)
	at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1495)
	at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:171)
	at org.apache.spark.sql.DataFrame.show(DataFrame.scala:394)
	at org.apache.spark.sql.DataFrame.show(DataFrame.scala:355)
	at org.apache.spark.sql.DataFrame.show(DataFrame.scala:363)

{noformat}

Checked, googled, stackoverflowed with no results.
Somehow it was trying to cast a value of the dropped column 20160100 to int (even though debugging
shows it as Long value: bigint)

 Also, we use Java and not Scala.



> ClassCast Exception while attempting to show() a DataFrame
> ----------------------------------------------------------
>
>                 Key: SPARK-13795
>                 URL: https://issues.apache.org/jira/browse/SPARK-13795
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>         Environment: Linux 14.04 LTS
>            Reporter: Ganesh Krishnan
>
> DataFrame Schema (by printSchema() ) is as follows
> allDataJoined.printSchema() 
> {noformat}
>  |-- eventType: string (nullable = true)
>  |-- itemId: string (nullable = true)
>  |-- productId: string (nullable = true)
>  |-- productVersion: string (nullable = true)
>  |-- servicedBy: string (nullable = true)
>  |-- ACCOUNT_NAME: string (nullable = true)
>  |-- CONTENTGROUPID: string (nullable = true)
>  |-- PRODUCT_ID: string (nullable = true)
>  |-- PROFILE_ID: string (nullable = true)
>  |-- SALESADVISEREMAIL: string (nullable = true)
>  |-- businessName: string (nullable = true)
>  |-- contentGroupId: string (nullable = true)
>  |-- salesAdviserName: string (nullable = true)
>  |-- salesAdviserPhone: string (nullable = true)
> {noformat}
> There is NO column that has any datatype except String. There used to be previously an
inferred column of type long that was dropped  
>  
> {code}
> DataFrame allDataJoined = whiteEventJoinedWithReference.
>                        drop(rliDataFrame.col("occurredAtDate"));
> allDataJoined.printSchema() : output above ^^
> Now 
> allDataJoined.show() 
>  
> {code}
> throws the following exception vv
> {noformat}
> java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
> 	at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
> 	at scala.math.Ordering$Int$.compare(Ordering.scala:256)
> 	at scala.math.Ordering$class.gt(Ordering.scala:97)
> 	at scala.math.Ordering$Int$.gt(Ordering.scala:256)
> 	at org.apache.spark.sql.catalyst.expressions.GreaterThan.nullSafeEval(predicates.scala:457)
> 	at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:383)
> 	at org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:238)
> 	at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
> 	at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
> 	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
> 	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
> 	at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
> 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 	at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
> 	at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
> 	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.prunePartitions(DataSourceStrategy.scala:257)
> 	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:82)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> 	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> 	at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.makeBroadcastHashJoin(SparkStrategies.scala:88)
> 	at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:97)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> 	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> 	at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> 	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> 	at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:349)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> 	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> 	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
> 	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:47)
> 	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:45)
> 	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:52)
> 	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:52)
> 	at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2134)
> 	at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1413)
> 	at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1495)
> 	at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:171)
> 	at org.apache.spark.sql.DataFrame.show(DataFrame.scala:394)
> 	at org.apache.spark.sql.DataFrame.show(DataFrame.scala:355)
> 	at org.apache.spark.sql.DataFrame.show(DataFrame.scala:363)
> {noformat}
> Checked, googled, stackoverflowed with no results.
> Edit: I managed to narrow down this bug to this usecase scenario:
> The raw json has the field dateOccuredAt and also the parquet it is being written to
also has partition dateOccuredAt. The raw JSON field is being inferred as String while the
partition is inferred as long which is correct too. However while persisting we have the above
error even if the column dateOccuredAt is dropped from the DataFrame
>  Also, we use Java and not Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message