spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-6082) SparkSQL should fail gracefully when input data format doesn't match expectations
Date Sun, 01 Mar 2015 09:39:04 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342094#comment-14342094
] 

Apache Spark commented on SPARK-6082:
-------------------------------------

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/4842

> SparkSQL should fail gracefully when input data format doesn't match expectations
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-6082
>                 URL: https://issues.apache.org/jira/browse/SPARK-6082
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.1
>            Reporter: Kay Ousterhout
>
> I have a udf that creates a tab-delimited table. If any of the column values contain
a tab, SQL fails with an ArrayIndexOutOfBounds exception (pasted below).  It would be great
if SQL failed gracefully here, with a helpful exception (something like "One row contained
too many values").
> It looks like this can be done quite easily, by checking here if i > columnBuilders.size
and if so, throwing a nicer exception: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala#L124.
> One thing that makes this problem especially annoying to debug is because if you do "CREATE
table foo as select transform(..." and then "CACHE table foo", it works fine.  It only fails
if you do "CACHE table foo as select transform(...".  Because of this, it would be great if
the problem were more transparent to users.
> Stack trace:
> java.lang.ArrayIndexOutOfBoundsException: 3
>   at org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.next(InMemoryColumnarTableScan.scala:125)
>   at org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.next(InMemoryColumnarTableScan.scala:112)
>   at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
>   at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:220)
>   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message