spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-20493) De-deuplicate parse logics for DDL-like type string in R
Date Thu, 27 Apr 2017 15:11:04 GMT
Hyukjin Kwon created SPARK-20493:
------------------------------------

             Summary: De-deuplicate parse logics for DDL-like type string in R
                 Key: SPARK-20493
                 URL: https://issues.apache.org/jira/browse/SPARK-20493
             Project: Spark
          Issue Type: Improvement
          Components: SparkR
    Affects Versions: 2.2.0
            Reporter: Hyukjin Kwon


It seems we are using SQLUtils.getSQLDataType[1] for type string in structField which looks
a catalog string[2].

It looks we can replace this with CatalystSqlParser.parseDataType[3].

They look similar DDL-like type definitions as below:

{code}
scala> Seq(Tuple1(Tuple1("a"))).toDF.show()
+---+
| _1|
+---+
|[a]|
+---+
{code}

{code}
scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show()
+---+
| _1|
+---+
|[a]|
+---+
{code}

Such type strings looks identical when R’s one as below:

{code}
> write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
> collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
  struct
1      a
{code}

It seems R’s one is more stricter because we are checking the types via regular expressions[4]
in R side.

Actual logics there look a bit different but as we check it ahead in R side, it looks replacing
it would not introduce no behaviour changes.

To make this sure, the tests dedicated for it was added in SPARK-20105.

[1] https://github.com/apache/spark/blob/d1f6c64c4b763c05d6d79ae5497f298dc3835f3e/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L93-L131
[2] https://github.com/apache/spark/blob/95ec4e25bb65f37f80222ffe70a95993a9149f80/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L66-L67
[3] https://github.com/apache/spark/blob/1472cac4bb31c1886f82830778d34c4dd9030d7a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala#L36-L40
[4] https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/R/pkg/R/schema.R#L129-L187



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message