spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luciano Resende (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (SPARK-6666) org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names
Date Tue, 08 Mar 2016 00:25:41 GMT

     [ https://issues.apache.org/jira/browse/SPARK-6666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Luciano Resende closed SPARK-6666.
----------------------------------
    Resolution: Cannot Reproduce

I have tried the scenarios above in Spark trunk using both Postgres and DB2, see:
https://github.com/lresende/spark-sandbox/blob/master/src/main/scala/com/luck/sql/JDBCApplication.scala

And the described issues seems not reproducible anymore, see all results below

root
 |-- Symbol: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sector: string (nullable = true)
 |-- Price: double (nullable = true)
 |-- Dividend Yield: double (nullable = true)
 |-- Price/Earnings: double (nullable = true)
 |-- Earnings/Share: double (nullable = true)
 |-- Book Value: double (nullable = true)
 |-- 52 week low: double (nullable = true)
 |-- 52 week high: double (nullable = true)
 |-- Market Cap: double (nullable = true)
 |-- EBITDA: double (nullable = true)
 |-- Price/Sales: double (nullable = true)
 |-- Price/Book: double (nullable = true)
 |-- SEC Filings: string (nullable = true)

+------+------+------+-----+--------------+--------------+--------------+----------+-----------+------------+----------+------+-----------+----------+-----------+
|Symbol|  Name|Sector|Price|Dividend Yield|Price/Earnings|Earnings/Share|Book Value|52 week
low|52 week high|Market Cap|EBITDA|Price/Sales|Price/Book|SEC Filings|
+------+------+------+-----+--------------+--------------+--------------+----------+-----------+------------+----------+------+-----------+----------+-----------+
|    S1|Name 1| Sec 1| 10.0|          10.0|          10.0|          10.0|      10.0|     
 10.0|        10.0|      10.0|  10.0|       10.0|      10.0|        100|
|    s2|Name 2| Sec 2| 20.0|          20.0|          20.0|          20.0|      20.0|     
 20.0|        20.0|      20.0|  20.0|       20.0|      20.0|        200|
+------+------+------+-----+--------------+--------------+--------------+----------+-----------+------------+----------+------+-----------+----------+-----------+

+------+
|AvgCPI|
+------+
|  15.0|
+------+

+------+------+------+-----+--------------+--------------+--------------+----------+-----------+------------+----------+------+-----------+----------+-----------+
|Symbol|  Name|Sector|Price|Dividend Yield|Price/Earnings|Earnings/Share|Book Value|52 week
low|52 week high|Market Cap|EBITDA|Price/Sales|Price/Book|SEC Filings|
+------+------+------+-----+--------------+--------------+--------------+----------+-----------+------------+----------+------+-----------+----------+-----------+
|    S1|Name 1| Sec 1| 10.0|          10.0|          10.0|          10.0|      10.0|     
 10.0|        10.0|      10.0|  10.0|       10.0|      10.0|        100|
|    s2|Name 2| Sec 2| 20.0|          20.0|          20.0|          20.0|      20.0|     
 20.0|        20.0|      20.0|  20.0|       20.0|      20.0|        200|
+------+------+------+-----+--------------+--------------+--------------+----------+-----------+------------+----------+------+-----------+----------+-----------+



> org.apache.spark.sql.jdbc.JDBCRDD  does not escape/quote column names
> ---------------------------------------------------------------------
>
>                 Key: SPARK-6666
>                 URL: https://issues.apache.org/jira/browse/SPARK-6666
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>         Environment:  
>            Reporter: John Ferguson
>            Priority: Critical
>
> Is there a way to have JDBC DataFrames use quoted/escaped column names?  Right now, it
looks like it "sees" the names correctly in the schema created but does not escape them in
the SQL it creates when they are not compliant:
> org.apache.spark.sql.jdbc.JDBCRDD
> ....
> private val columnList: String = {
> val sb = new StringBuilder()
> columns.foreach(x => sb.append(",").append(x))
> if (sb.length == 0) "1" else sb.substring(1)
> }
> If you see value in this, I would take a shot at adding the quoting (escaping) of column
names here.  If you don't do it, some drivers... like postgresql's will simply drop case all
names when parsing the query.  As you can see in the TL;DR below that means they won't match
the schema I am given.
> TL;DR:
> -------- 
> I am able to connect to a Postgres database in the shell (with driver referenced):
>    val jdbcDf = sqlContext.jdbc("jdbc:postgresql://localhost/sparkdemo?user=dbuser",
"sp500")
> In fact when I run:
>    jdbcDf.registerTempTable("sp500")
>    val avgEPSNamed = sqlContext.sql("SELECT AVG(`Earnings/Share`) as AvgCPI FROM sp500")
> and
>    val avgEPSProg = jsonDf.agg(avg(jsonDf.col("Earnings/Share")))
> The values come back as expected.  However, if I try:
>    jdbcDf.show
> Or if I try
>    
>    val all = sqlContext.sql("SELECT * FROM sp500")
>    all.show
> I get errors about column names not being found.  In fact the error includes a mention
of column names all lower cased.  For now I will change my schema to be more restrictive.
 Right now it is, per a Stack Overflow poster, not ANSI compliant by doing things that are
allowed by ""'s in pgsql, MySQL and SQLServer.  BTW, our users are giving us tables like this...
because various tools they already use support non-compliant names.  In fact, this is mild
compared to what we've had to support.
> Currently the schema in question uses mixed case, quoted names with special characters
and spaces:
> CREATE TABLE sp500
> (
> "Symbol" text,
> "Name" text,
> "Sector" text,
> "Price" double precision,
> "Dividend Yield" double precision,
> "Price/Earnings" double precision,
> "Earnings/Share" double precision,
> "Book Value" double precision,
> "52 week low" double precision,
> "52 week high" double precision,
> "Market Cap" double precision,
> "EBITDA" double precision,
> "Price/Sales" double precision,
> "Price/Book" double precision,
> "SEC Filings" text
> ) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message