arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Possible Decimal write issue with pyarrow
Date Tue, 02 Jun 2020 12:28:40 GMT
hi Joris -- thank you for investigating. There's some code in the
Parquet write path that deals with converting the 128-bit decimals to
the Parquet representation which is usually smaller than 16 bytes per
value, so I would guess the bug lies in this Arrow/128-bit decimal to
Parquet FIXED_LEN_BYTE_ARRAY representation

On Tue, Jun 2, 2020 at 2:52 AM Joris Van den Bossche
<jorisvandenbossche@gmail.com> wrote:
>
> Hi Rich,
>
> Thanks for the report.
> It seems that the issue is in the parquet writing or reading itself, and not the pandas<->pyarrow
conversion.
>
> Converting from python to pyarrow looks OK:
>
> In [15]: arr = pa.array([decimal.Decimal('9223372036854775808'), decimal.Decimal('1.111')])
>
> In [16]: arr
> Out[16]:
> <pyarrow.lib.Decimal128Array object at 0x7fd07d79a468>
> [
>   9223372036854775808.000,
>   1.111
> ]
>
> But then writing and reading again to/from parquet gives the issue:
>
> In [17]: pq.write_table(pa.table({'a': arr}), "test_decimal.parquet")
>
> In [18]: pq.read_table("test_decimal.parquet")
> Out[18]:
> pyarrow.Table
> a: decimal(19, 3)
>
> In [19]: pq.read_table("test_decimal.parquet").column('a')
> Out[19]:
> <pyarrow.lib.ChunkedArray object at 0x7fd0711e9f98>
> [
>   [
>     -221360928884514619.392,
>     1.111
>   ]
> ]
>
> This happens here with a "decimal(19, 3)" type, when using 1.11 instead of 1.111, the
decimal type is "decimal(19, 2)".
>
> I am not too familiar with the decimal type, but I opened a JIRA issue for this: https://issues.apache.org/jira/browse/PARQUET-1869
>
> Joris
>
>
> On Mon, 1 Jun 2020 at 23:39, Rich Bramante <rbramante@hotmail.com> wrote:
>>
>> Python 3.7.6 (default, Jan 30 2020, 10:29:04)
>> [GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
>> print(pyarrow.__version__)
>> 0.17.1
>>
>> Seeing an issue where DECIMAL values written can seem to be corrupted based on very
subtle changes to the data set. Example:
>>
>> #!/bin/python3
>>
>> import pandas as pd
>> import decimal
>> import pyarrow.parquet as pq
>>
>> #$ python3
>> # Python 3.7.6 (default, Jan 30 2020, 10:29:04)
>> # [GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
>> # >>> print(pyarrow.__version__)
>> #  0.17.1
>>
>> # Results in unexpected output
>> df = pd.DataFrame({"values": [decimal.Decimal('9223372036854775808'), decimal.Decimal('18446744073709551616'),
decimal.Decimal('2147483648'), decimal.Decimal('1.111'), decimal.Decimal('-2'), decimal.Decimal('0')]})
>>
>> df.to_parquet("/tmp/f")
>> pq_file = pq.ParquetFile("/tmp/f")
>> print (pq_file.read().to_pandas())
>>
>> #Values Read:
>> # -221360928884514619.392, -442721857769029238.784,2147483648.000,1.111,-2.000,0.000
>>
>> # Results in expected output (only difference is 1.11 vs. 1.111)
>> df = pd.DataFrame({"values": [decimal.Decimal('9223372036854775808'), decimal.Decimal('18446744073709551616'),
decimal.Decimal('2147483648'), decimal.Decimal('1.11'), decimal.Decimal('-2'), decimal.Decimal('0')]})
>>
>> #Values Read:
>> 9223372036854775808.00,18446744073709551616.00,2147483648.00,1.11,-2.00,0.00
>>
>> df.to_parquet("/tmp/f")
>> pq_file = pq.ParquetFile("/tmp/f")
>> print (pq_file.read().to_pandas())
>>

Mime
View raw message