spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ueshin <...@git.apache.org>
Subject [GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...
Date Wed, 18 Oct 2017 06:41:46 GMT
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18664#discussion_r145327316
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -259,11 +261,13 @@ def load_stream(self, stream):
             """
             Deserialize ArrowRecordBatches to an Arrow table and return as a list of pandas.Series.
             """
    +        from pyspark.sql.types import _check_dataframe_localize_timestamps
             import pyarrow as pa
             reader = pa.open_stream(stream)
             for batch in reader:
    -            table = pa.Table.from_batches([batch])
    -            yield [c.to_pandas() for c in table.itercolumns()]
    +            # NOTE: changed from pa.Columns.to_pandas, timezone issue in conversion fixed
in 0.7.1
    +            pdf = _check_dataframe_localize_timestamps(batch.to_pandas())
    +            yield [c for _, c in pdf.iteritems()]
    --- End diff --
    
    I ran your script in my local, too.
    
    - before change: 
      - mean: 2.605722
      - min: 2.502404
      - max: 3.045294
    - after change:
      - mean: 2.626306
      - min: 2.341781
      - max: 2.742432
    
    I think it's okay to use this workaround.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message