spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From BryanCutler <>
Subject [GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...
Date Mon, 23 Oct 2017 23:18:36 GMT
Github user BryanCutler commented on a diff in the pull request:
    --- Diff: python/pyspark/ ---
    @@ -224,7 +225,13 @@ def _create_batch(series):
         # If a nullable integer series has been promoted to floating point with NaNs, need
to cast
         # NOTE: this is not necessary with Arrow >= 0.7
         def cast_series(s, t):
    -        if t is None or s.dtype == t.to_pandas_dtype():
    +        if type(t) == pa.TimestampType:
    +            # NOTE: convert to 'us' with astype here, unit ignored in `from_pandas` see
    +            return _series_convert_timestamps_internal(s).values.astype('datetime64[us]')
    +        elif t == pa.date32():
    +            # TODO: ValueError: Cannot cast DatetimeIndex to dtype datetime64[D]
    --- End diff --
    I came across an issue when using `pandas_udf` that returns a `DateType`.  Dates are input
to the udf as Series with dtype `datetime64[ns]` and trying to use this for `pa.Array.from_pandas`
with `type=pa.date32()` fails with an error.  I am also unable to call `series.dt.values.astype('datetime64[D]')`
which results in an error.  Without specifying the type, pyarrow will read the values as a
timestamp.  I filed to look into this, aside
from this being fixed in a new version any ideas for a workaround @ueshin and @HyukjinKwon
    cc @wesm 


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message