spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From icexelloss <...@git.apache.org>
Subject [GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...
Date Fri, 25 May 2018 14:13:02 GMT
Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    @rxin @gatorsmile thanks for joining the discussion!
    
    On the configuration side, we have already some mechanism to do so for the "timezone"
config:
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala#L48
    I'd imagine we could extend the mechanism to support arbitrary configuration map. 
    
    On the behavior side, I think more about this and I feel a desirable behavior is support
both matching by name and by index, i.e.
    (1) If the output dataframe has the same column names as the schema, we match by column
name, this is desirable behavior where user do:
    ```
    return pd.DataFrame({'a': ..., 'b': ...})
    ```
    (2) If the output dataframe has column names "0, 1, ,2 ...", we match by indices, this
is because when user doesn't specify column names when creating a pd.DataFrame, that's the
default column names, e.g.
    ```
    >>> pd.DataFrame([[1, 2.0, "hello"], [4, 5.0, "xxx"]])
       0    1      2
    0  1  2.0  hello
    1  4  5.0    xxx
    ``` 
    (3) throw exception otherwise
    
    What do you think of having the new configuration support this behavior?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message