spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammed Guller <moham...@glassbeam.com>
Subject RE: Spark SQL parser bug?
Date Fri, 10 Oct 2014 18:08:41 GMT
Hi Chen,
Thanks for looking into this.

It looks like the bug may be in the Spark Cassandra connector code. Table x is a table in
Cassandra.

However, while trying to troubleshoot this issue, I noticed another issue. This time I did
not use Cassandra; instead created a table on the fly. I am not seeing the same issue, but
the results do not like right. Here is a my complete Spark-shell session:

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.1.0
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.
14/10/10 11:05:11 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1;
using 192.168.59.135 instead (on interface eth0)
14/10/10 11:05:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
14/10/10 11:05:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
Spark context available as sc.

scala> import org.apache.spark.sql._
import org.apache.spark.sql._

scala> val sqlContext = new SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@2be5c74d

scala> import sqlContext.createSchemaRDD
import sqlContext.createSchemaRDD

scala> case class X(a: Int, ts: java.sql.Timestamp)
defined class X

scala> val rdd = sc.parallelize( 1 to 5).map{ n => X(n, new java.sql.Timestamp(1325548800000L
+ n*86400000))}
rdd: org.apache.spark.rdd.RDD[X] = MappedRDD[1] at map at <console>:20

scala> rdd.collect
res0: Array[X] = Array(X(1,2012-01-03 16:00:00.0), X(2,2012-01-04 16:00:00.0), X(3,2012-01-05
16:00:00.0), X(4,2012-01-06 16:00:00.0), X(5,2012-01-07 16:00:00.0))

scala> rdd.registerTempTable("x")

scala> val sRdd = sqlContext.sql("select a from x where ts >= '2012-01-01T00:00:00';")
sRdd: org.apache.spark.sql.SchemaRDD =
SchemaRDD[4] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
Project [a#0]
ExistingRdd [a#0,ts#1], MapPartitionsRDD[6] at mapPartitions at basicOperators.scala:208

scala> sRdd.collect
res2: Array[org.apache.spark.sql.Row] = Array()



Mohammed

From: Cheng Lian [mailto:lian.cs.zju@gmail.com]
Sent: Friday, October 10, 2014 4:37 AM
To: Mohammed Guller; user@spark.apache.org
Subject: Re: Spark SQL parser bug?


Hi Mohammed,

Would you mind to share the DDL of the table x and the complete stacktrace of the exception
you got? A full Spark shell session history would be more than helpful. PR #2084 had been
merged in master in Aug, and timestamp type is supported in 1.1.

I tried the following snippets in Spark shell (v1.1), and didn’t observe this issue:

scala> import org.apache.spark.sql._

import org.apache.spark.sql._



scala> import sc._

import sc._



scala> val sqlContext = new SQLContext(sc)

sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@6c3441c5<mailto:org.apache.spark.sql.SQLContext@6c3441c5>



scala> import sqlContext._

import sqlContext._



scala> case class Record(a: Int, ts: java.sql.Timestamp)

defined class Record



scala> makeRDD(Seq.empty[Record], 1).registerTempTable("x")



scala> sql("SELECT a FROM x WHERE ts >= '2012-01-01T00:00:00' AND ts <= '2012-03-31T23:59:59'")

res1: org.apache.spark.sql.SchemaRDD =

SchemaRDD[3] at RDD at SchemaRDD.scala:103

== Query Plan ==

== Physical Plan ==

Project [a#0]

 ExistingRdd [a#0,ts#1], MapPartitionsRDD[5] at mapPartitions at basicOperators.scala:208



scala> res1.collect()

...

res2: Array[org.apache.spark.sql.Row] = Array()

Cheng

On 10/9/14 10:26 AM, Mohammed Guller wrote:
Hi –

When I run the following Spark SQL query in Spark-shell ( version 1.1.0) :

val rdd = sqlContext.sql("SELECT a FROM x WHERE ts >= '2012-01-01T00:00:00' AND ts <=
'2012-03-31T23:59:59' ")

it gives the following error:
rdd: org.apache.spark.sql.SchemaRDD =
SchemaRDD[294] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
java.util.NoSuchElementException: head of empty list

The ts column in the where clause has timestamp data and is of type timestamp. If I replace
the string '2012-01-01T00:00:00' in the where clause with its epoch value, then the query
works fine.

It looks I have run into an issue described in this pull request: https://github.com/apache/spark/pull/2084

Is that PR not merged in Spark version 1.1.0? Or am I missing something?

Thanks,
Mohammed


​
Mime
View raw message