spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: column identifiers in Spark SQL
Date Tue, 22 Sep 2015 17:58:36 GMT
Are you using a SQLContext or a HiveContext?  The programming guide
suggests the latter, as the former is really only there because some
applications may have conflicts with Hive dependencies.  SQLContext is case
sensitive by default where as the HiveContext is not.  The parser in
HiveContext is also a lot better.

On Tue, Sep 22, 2015 at 10:53 AM, Richard Hillegas <rhilleg@us.ibm.com>
wrote:

> I am puzzled by the behavior of column identifiers in Spark SQL. I don't
> find any guidance in the "Spark SQL and DataFrame Guide" at
> http://spark.apache.org/docs/latest/sql-programming-guide.html. I am
> seeing odd behavior related to case-sensitivity and to delimited (quoted)
> identifiers.
>
> Consider the following declaration of a table in the Derby relational
> database, whose dialect hews closely to the SQL Standard:
>
>    create table app.t( a int, "b" int, "c""d" int );
>
> Now let's load that table into Spark like this:
>
>   import org.apache.spark.sql._
>   import org.apache.spark.sql.types._
>
>   val df = sqlContext.read.format("jdbc").options(
>     Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
>     "dbtable" -> "app.t")).load()
>   df.registerTempTable("test_data")
>
> The following query runs fine because the column name matches the
> normalized form in which it is stored in the metadata catalogs of the
> relational database:
>
>   // normalized column names are recognized
>   sqlContext.sql(s"""select A from test_data""").show
>
> But the following query fails during name resolution. This puzzles me
> because non-delimited identifiers are case-insensitive in the ANSI/ISO
> Standard. They are also supposed to be case-insensitive in HiveQL, at least
> according to section 2.3.1 of the QuotedIdentifier.html webpage attached to
> https://issues.apache.org/jira/browse/HIVE-6013:
>
>   // ...unnormalized column names raise this error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input
> columns A, b, c"d;
>   sqlContext.sql("""select a from test_data""").show
>
> Delimited (quoted) identifiers are treated as string literals. Again,
> non-Standard behavior:
>
>   // this returns rows consisting of the string literal "b"
>   sqlContext.sql("""select "b" from test_data""").show
>
> Embedded quotes in delimited identifiers won't even parse:
>
>   // embedded quotes raise this error: java.lang.RuntimeException: [1.11]
> failure: ``union'' expected but "d" found
>   sqlContext.sql("""select "c""d" from test_data""").show
>
> This behavior is non-Standard and it strikes me as hard to describe to
> users concisely. Would the community support an effort to bring the
> handling of column identifiers into closer conformance with the Standard?
> Would backward compatibility concerns even allow us to do that?
>
> Thanks,
> -Rick
>

Mime
View raw message