arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Gallagher <dgallag...@CleverDevices.com>
Subject [Python] Different behavior between pandas.to_parquet and ParquetWriter.write_table?
Date Mon, 03 Aug 2020 15:10:37 GMT
Hi – I have a pandas dataframe that I want to output to parquet. The dataframe has a timestamp
field with timezone information. I need control over the schema at output, so I am using ParquetWriter
and a schema with the timestamp column defined as:


('timestamp', pa.timestamp('s', tz=self._timezone)),

Where timezone is a string, e.g. ‘America/Los_Angeles’. I’m then writing out the file
using this code:


schema = pa.schema(fields)
table = pa.Table.from_pandas(self._df, schema, preserve_index=False).replace_schema_metadata()
writer = pq.ParquetWriter(os.path.join(file_path, '{}.parquet'.format(self._file_name)), schema=schema)
writer.write_table(table)
writer.close()

However, upon reading the resulting file, the timestamp is in UTC:

timestamp              datetime64[ns, UTC]

But, if I output the same pandas dataframe to parquet directly, the timestamp is localized.
Is this expected behavior? I’m using pyarrow 1.0.0. I tried playing with the ‘flavor’
argument of ParquetWriter, but this just seemed to generate naïve UTC timestamps.

Thanks,

Dave


Mime
View raw message