spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-20493) De-deuplicate parse logics for DDL-like type string in R
Date Thu, 27 Apr 2017 15:27:04 GMT

    [ https://issues.apache.org/jira/browse/SPARK-20493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986818#comment-15986818
] 

Apache Spark commented on SPARK-20493:
--------------------------------------

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17785

> De-deuplicate parse logics for DDL-like type string in R
> --------------------------------------------------------
>
>                 Key: SPARK-20493
>                 URL: https://issues.apache.org/jira/browse/SPARK-20493
>             Project: Spark
>          Issue Type: Improvement
>          Components: SparkR
>    Affects Versions: 2.2.0
>            Reporter: Hyukjin Kwon
>
> It seems we are using SQLUtils.getSQLDataType[1] for type string in structField.
> It looks we can replace this with CatalystSqlParser.parseDataType[2].
> They look similar DDL-like type definitions as below:
> {code}
> scala> Seq(Tuple1(Tuple1("a"))).toDF.show()
> +---+
> | _1|
> +---+
> |[a]|
> +---+
> {code}
> {code}
> scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show()
> +---+
> | _1|
> +---+
> |[a]|
> +---+
> {code}
> Such type strings looks identical when R’s one as below:
> {code}
> > write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
> > collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
>   struct
> 1      a
> {code}
> It seems R’s one is more stricter because we are checking the types via regular expressions[3]
in R side.
> Actual logics there look a bit different but as we check it ahead in R side, it looks
replacing it would not introduce no behaviour changes.
> To make this sure, the tests dedicated for it was added in SPARK-20105.
> [1] https://github.com/apache/spark/blob/d1f6c64c4b763c05d6d79ae5497f298dc3835f3e/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L93-L131
> [2] https://github.com/apache/spark/blob/1472cac4bb31c1886f82830778d34c4dd9030d7a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala#L36-L40
> [3] https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/R/pkg/R/schema.R#L129-L187



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message