spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Felix Cheung (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-21450) List of NA is flattened inside a SparkR struct type
Date Tue, 18 Jul 2017 05:28:03 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091132#comment-16091132
] 

Felix Cheung commented on SPARK-21450:
--------------------------------------

[~hyukjin.kwon] 

If you follow the code in test_sparkSQL.R, 
{code}
df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
  schema2 <- structType(structField("date", "date"))
  s <- collect(select(df, from_json(df$col, schema2)))
  expect_equal(s[[1]][[1]], NA)
  s <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/yyyy")))
{code}

Both lines should be using schema2 - not schema. schema is actually defined as 
{code}
schema <- structType(structField("age", "integer"),
                       structField("height", "double"))
{code}
 which doesn't match the json blob.

Is this a copy/paste error in this JIRA? could you check?

In any case, I wonder - didn't get to test it in Scala - if the different result is cause
by unparseable json blob because schema/format passed in. The logi NA would be a null in Scala

> List of NA is flattened inside a SparkR struct type
> ---------------------------------------------------
>
>                 Key: SPARK-21450
>                 URL: https://issues.apache.org/jira/browse/SPARK-21450
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 2.2.0
>            Reporter: Hossein Falaki
>
> Consider the following two cases copied from {{test_sparkSQL.R}}:
> {code}
> df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
> schema <- structType(structField("date", "date"))
> s1 <- collect(select(df, from_json(df$col, schema)))
> s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/yyyy")))
> {code}
> If you inspect s1 using {{str(s1)}} you will find:
> {code}
> 'data.frame':	2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ : logi NA
> {code}
> But for s2, running {{str(s2)}} results in:
> {code}
> 'data.frame':	2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ :List of 1
>   .. ..$ date: Date, format: "2014-10-21"
>   .. ..- attr(*, "class")= chr "struct"
> {code}
> I assume this is not intentional and is just a subtle bug. Do you think otherwise? [~shivaram]
and [~felixcheung]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message