spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <>
Subject [jira] [Created] (SPARK-20493) De-deuplicate parse logics for DDL-like type string in R
Date Thu, 27 Apr 2017 15:11:04 GMT
Hyukjin Kwon created SPARK-20493:

             Summary: De-deuplicate parse logics for DDL-like type string in R
                 Key: SPARK-20493
             Project: Spark
          Issue Type: Improvement
          Components: SparkR
    Affects Versions: 2.2.0
            Reporter: Hyukjin Kwon

It seems we are using SQLUtils.getSQLDataType[1] for type string in structField which looks
a catalog string[2].

It looks we can replace this with CatalystSqlParser.parseDataType[3].

They look similar DDL-like type definitions as below:

scala> Seq(Tuple1(Tuple1("a")))
| _1|

scala> Seq(Tuple1(Tuple1("a")))$"_1".cast("struct<_1:string>")).show()
| _1|

Such type strings looks identical when R’s one as below:

> write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
> collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
1      a

It seems R’s one is more stricter because we are checking the types via regular expressions[4]
in R side.

Actual logics there look a bit different but as we check it ahead in R side, it looks replacing
it would not introduce no behaviour changes.

To make this sure, the tests dedicated for it was added in SPARK-20105.


This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message