arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neal Richardson <neal.p.richard...@gmail.com>
Subject Re: Dataset filter expression in Arrow 3 vs 4
Date Fri, 04 Jun 2021 21:14:28 GMT
Implicit casting is still supported, it's just different in 4.0. AFAIK the
clearest statement we have of the change is on the R changelog [1]:

"Arrow C++ compute functions now do more systematic type promotion when
called on data with different types (e.g. int32 and float64). Previously,
Scalars in an expressions were always cast to match the type of the
corresponding Array, so this new type promotion enables, among other
things, operations on two columns (Arrays) in a dataset. As a side effect,
some comparisons that worked in prior versions are no longer supported: for
example, dplyr::filter(arrow_dataset, string_column == 3) will error with a
message about the type mismatch between the numeric 3 and the string type
of string_column."

So in your case, you might use `date(2021, 1, 1)` instead of the string
"2021-01-01".

Neal

[1]: https://arrow.apache.org/docs/r/news/index.html

On Fri, Jun 4, 2021 at 1:27 PM Troy Zimmerman <tzimmerman@jumptrading.com>
wrote:

> Hello,
>
>
>
> When upgrading from Arrow 3 to 4, I noticed a change in behavior with
> dataset expressions.
>
>
>
> Here’s a barebones example.
>
>
>
> >>> import pyarrow as pa
>
> >>> import pyarrow.dataset as ds
>
> >>> import pyarrow.parquet as pq
>
> >>> tbl = pa.Table.from_pydict(
>
> …     {"id": [1, 2, 3, 4, 5],
>
> …      "valid_from": [date(2021, 1, 1), date(2021, 1, 1), date(2021, 1,
> 1), date(2021, 1, 1), date(2021, 1, 1)],
>
> …      "valid_to": [date(2021, 1, 31), date(2021, 1, 31), date(2021, 2,
> 15), date(2021, 2, 15), date(2021, 3, 1)]
>
> …     }
>
> … )
>
> >>> tbl = tbl.cast(pa.schema([("id", pa.int64()), ("valid_from",
> pa.timestamp("ns")), ("valid_to", pa.timestamp("ns"))]))
>
> >>> pq.write_table(tbl, "/home/tzimmerman/test.parquet")
>
> >>> data = ds.dataset("/home/tzimmerman/test.parquet")
>
> >>> expr = (ds.field("valid_from") <= "2021-01-01") & ("2021-01-01" <
> ds.field("valid_to"))
>
>
>
> With Python 3.8 & Arrow 3.0.0, it appears the date literal (“2021-01-1”)
> is cast to a compatible dtype and the expression works as expected.
>
>
>
> >>> data.to_table(filter=expr)
>
> pyarrow.Table
>
> id: int64
>
> valid_from: timestamp[us]
>
> valid_to: timestamp[us]
>
>
>
> With Python 3.8 & Arrow 4.0.0, there is a kernel mismatch error.
>
>
>
> >>> data.to_table(filter=expr)
>
> Traceback (most recent call last):
>
>  File "<stdin>", line 1, in <module>
>
>   File "pyarrow/_dataset.pyx", line 458, in
> pyarrow._dataset.Dataset.to_table
>
>   File "pyarrow/_dataset.pyx", line 363, in
> pyarrow._dataset.Dataset._scanner
>
>   File "pyarrow/_dataset.pyx", line 2786, in
> pyarrow._dataset.Scanner.from_dataset
>
>   File "pyarrow/_dataset.pyx", line 2683, in
> pyarrow._dataset._populate_builder
>
>   File "pyarrow/_dataset.pyx", line 737, in pyarrow._dataset._bind
>
>   File "pyarrow/error.pxi", line 141, in
> pyarrow.lib.pyarrow_internal_check_status
>
>   File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status
>
> pyarrow.lib.ArrowNotImplementedError: Function less_equal has no kernel
> matching input types (array[timestamp[us]], scalar[string])
>
>
>
> I just want to confirm that this Is expected, and the implicit
> casting/coercing is now unsupported?
>
>
>
> Best,
>
> Troy
>

Mime
View raw message