arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [Python] Different behavior between pandas.to_parquet and ParquetWriter.write_table?
Date Mon, 03 Aug 2020 16:37:43 GMT
I think the Parquet layer should probably restore a non-UTC timezone.
We store enough metadata that this should be possible:

In [20]: df = pd.DataFrame({'a': pd.Series(np.arange(0, 10000,
1000)).astype(pd.DatetimeTZDtype('ns', 'America/Los_Angeles'
    ...: ))})

In [21]: t = pa.table(df)

In [22]: t
Out[22]:
pyarrow.Table
a: timestamp[ns, tz=America/Los_Angeles]

In [23]: pq.write_table(t, 'test.parquet')

In [24]: pq.read_table('test.parquet')
Out[24]:
pyarrow.Table
a: timestamp[us, tz=UTC]

In [25]: pq.read_table('test.parquet')[0]
Out[25]:
<pyarrow.lib.ChunkedArray object at 0x7f72eb4b68f0>
[
  [
    1970-01-01 00:00:00.000000,
    1970-01-01 00:00:00.000001,
    1970-01-01 00:00:00.000002,
    1970-01-01 00:00:00.000003,
    1970-01-01 00:00:00.000004,
    1970-01-01 00:00:00.000005,
    1970-01-01 00:00:00.000006,
    1970-01-01 00:00:00.000007,
    1970-01-01 00:00:00.000008,
    1970-01-01 00:00:00.000009
  ]
]

I opened https://issues.apache.org/jira/browse/ARROW-9634 so someone
can look into it

On Mon, Aug 3, 2020 at 10:10 AM David Gallagher
<dgallagher@cleverdevices.com> wrote:
>
> Hi – I have a pandas dataframe that I want to output to parquet. The dataframe has
a timestamp field with timezone information. I need control over the schema at output, so
I am using ParquetWriter and a schema with the timestamp column defined as:
>
>
>
> ('timestamp', pa.timestamp('s', tz=self._timezone)),
>
>
>
> Where timezone is a string, e.g. ‘America/Los_Angeles’. I’m then writing out the
file using this code:
>
>
>
> schema = pa.schema(fields)
> table = pa.Table.from_pandas(self._df, schema, preserve_index=False).replace_schema_metadata()
> writer = pq.ParquetWriter(os.path.join(file_path, '{}.parquet'.format(self._file_name)),
schema=schema)
> writer.write_table(table)
> writer.close()
>
>
>
> However, upon reading the resulting file, the timestamp is in UTC:
>
>
>
> timestamp              datetime64[ns, UTC]
>
>
>
> But, if I output the same pandas dataframe to parquet directly, the timestamp is localized.
Is this expected behavior? I’m using pyarrow 1.0.0. I tried playing with the ‘flavor’
argument of ParquetWriter, but this just seemed to generate naïve UTC timestamps.
>
>
>
> Thanks,
>
>
>
> Dave
>
>

Mime
View raw message