arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Troy Zimmerman <>
Subject RE: Dataset filter expression in Arrow 3 vs 4
Date Fri, 04 Jun 2021 22:36:46 GMT
Hi Neil,

Thanks for the explanation. I use the ast module to parse expressions and map them to the
dataset API, and it never dawned on me until the upgrade what I was getting for free. 😊
It’s easy enough to handle a cast function of our own in the ast visitor.


From: Neal Richardson <>
Sent: Friday, June 4, 2021 4:14 PM
Subject: Re: Dataset filter expression in Arrow 3 vs 4

IRONSCALES couldn't recognize this email as this is the first time you received an email from
this sender<>

[This message has originated from an EXTERNAL SENDER]
Implicit casting is still supported, it's just different in 4.0. AFAIK the clearest statement
we have of the change is on the R changelog [1]:

"Arrow C++ compute functions now do more systematic type promotion when called on data with
different types (e.g. int32 and float64). Previously, Scalars in an expressions were always
cast to match the type of the corresponding Array, so this new type promotion enables, among
other things, operations on two columns (Arrays) in a dataset. As a side effect, some comparisons
that worked in prior versions are no longer supported: for example, dplyr::filter(arrow_dataset,
string_column == 3) will error with a message about the type mismatch between the numeric
3 and the string type of string_column."

So in your case, you might use `date(2021, 1, 1)` instead of the string "2021-01-01".



On Fri, Jun 4, 2021 at 1:27 PM Troy Zimmerman <<>>

When upgrading from Arrow 3 to 4, I noticed a change in behavior with dataset expressions.

Here’s a barebones example.

>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> import pyarrow.parquet as pq
>>> tbl = pa.Table.from_pydict(
…     {"id": [1, 2, 3, 4, 5],
…      "valid_from": [date(2021, 1, 1), date(2021, 1, 1), date(2021, 1, 1), date(2021, 1,
1), date(2021, 1, 1)],
…      "valid_to": [date(2021, 1, 31), date(2021, 1, 31), date(2021, 2, 15), date(2021,
2, 15), date(2021, 3, 1)]
…     }
… )
>>> tbl = tbl.cast(pa.schema([("id", pa.int64()), ("valid_from", pa.timestamp("ns")),
("valid_to", pa.timestamp("ns"))]))
>>> pq.write_table(tbl, "/home/tzimmerman/test.parquet")
>>> data = ds.dataset("/home/tzimmerman/test.parquet")
>>> expr = (ds.field("valid_from") <= "2021-01-01") & ("2021-01-01" < ds.field("valid_to"))

With Python 3.8 & Arrow 3.0.0, it appears the date literal (“2021-01-1”) is cast to
a compatible dtype and the expression works as expected.

>>> data.to_table(filter=expr)
id: int64
valid_from: timestamp[us]
valid_to: timestamp[us]

With Python 3.8 & Arrow 4.0.0, there is a kernel mismatch error.

>>> data.to_table(filter=expr)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
  File "pyarrow/_dataset.pyx", line 458, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 363, in pyarrow._dataset.Dataset._scanner
  File "pyarrow/_dataset.pyx", line 2786, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 2683, in pyarrow._dataset._populate_builder
  File "pyarrow/_dataset.pyx", line 737, in pyarrow._dataset._bind
  File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function less_equal has no kernel matching input types
(array[timestamp[us]], scalar[string])

I just want to confirm that this Is expected, and the implicit casting/coercing is now unsupported?

View raw message