arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Troy Zimmerman <tzimmer...@jumptrading.com>
Subject Dataset filter expression in Arrow 3 vs 4
Date Fri, 04 Jun 2021 20:27:23 GMT
Hello,

When upgrading from Arrow 3 to 4, I noticed a change in behavior with dataset expressions.

Here's a barebones example.

>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> import pyarrow.parquet as pq
>>> tbl = pa.Table.from_pydict(
...     {"id": [1, 2, 3, 4, 5],
...      "valid_from": [date(2021, 1, 1), date(2021, 1, 1), date(2021, 1, 1), date(2021, 1,
1), date(2021, 1, 1)],
...      "valid_to": [date(2021, 1, 31), date(2021, 1, 31), date(2021, 2, 15), date(2021,
2, 15), date(2021, 3, 1)]
...     }
... )
>>> tbl = tbl.cast(pa.schema([("id", pa.int64()), ("valid_from", pa.timestamp("ns")),
("valid_to", pa.timestamp("ns"))]))
>>> pq.write_table(tbl, "/home/tzimmerman/test.parquet")
>>> data = ds.dataset("/home/tzimmerman/test.parquet")
>>> expr = (ds.field("valid_from") <= "2021-01-01") & ("2021-01-01" < ds.field("valid_to"))

With Python 3.8 & Arrow 3.0.0, it appears the date literal ("2021-01-1") is cast to a
compatible dtype and the expression works as expected.

>>> data.to_table(filter=expr)
pyarrow.Table
id: int64
valid_from: timestamp[us]
valid_to: timestamp[us]

With Python 3.8 & Arrow 4.0.0, there is a kernel mismatch error.

>>> data.to_table(filter=expr)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
  File "pyarrow/_dataset.pyx", line 458, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 363, in pyarrow._dataset.Dataset._scanner
  File "pyarrow/_dataset.pyx", line 2786, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 2683, in pyarrow._dataset._populate_builder
  File "pyarrow/_dataset.pyx", line 737, in pyarrow._dataset._bind
  File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function less_equal has no kernel matching input types
(array[timestamp[us]], scalar[string])

I just want to confirm that this Is expected, and the implicit casting/coercing is now unsupported?

Best,
Troy

Mime
View raw message