arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoine Pitrou <anto...@python.org>
Subject Re: [DISCUSS] IPC buffer layout for Null type
Date Fri, 06 Sep 2019 17:08:32 GMT

Null can also come up when converting a column with only NA values in a
CSV file.  I don't remember for sure, but I think the same can happen
with JSON files as well.

Can't we accept both forms when reading?  It sounds like it should be
reasonably easy.

Regards

Antoine.


Le 06/09/2019 à 17:36, Wes McKinney a écrit :
> hi Micah,
> 
> Null wouldn't come up that often in practice. It could happen when
> converting from pandas, for example
> 
> In [8]: df = pd.DataFrame({'col1': np.array([np.nan] * 10, dtype=object)})
> 
> In [9]: t = pa.table(df)
> 
> In [10]: t
> Out[10]:
> pyarrow.Table
> col1: null
> metadata
> --------
> {b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
>             b'stop": 10, "step": 1}], "column_indexes": [{"name": null, "field'
>             b'_name": null, "pandas_type": "unicode", "numpy_type": "object", '
>             b'"metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "col1"'
>             b', "field_name": "col1", "pandas_type": "empty", "numpy_type": "o'
>             b'bject", "metadata": null}], "creator": {"library": "pyarrow", "v'
>             b'ersion": "0.14.1.dev464+g40d08a751"}, "pandas_version": "0.24.2"'
>             b'}'}
> 
> I'm inclined to make the change without worrying about backwards
> compatibility. If people have been persisting data against the
> recommendations of the project, the remedy is to use an older version
> of the library to read the files and write them to something else
> (like Parquet format) in the meantime.
> 
> Obviously come 1.0.0 we'll begin to make compatibility guarantees so
> this will be less of an issue.
> 
> - Wes
> 
> On Thu, Sep 5, 2019 at 11:14 PM Micah Kornfield <emkornfield@gmail.com> wrote:
>>
>> Hi Wes and others,
>> I don't have a sense of where Null arrays get created in the existing code
>> base?
>>
>> Also, do you think it is worth the effort make this backwards compatible.
>> We could in theory tie the buffer count to having the continuation value
>> for alignment.
>>
>> The one area were I'm slightly concerned is we seem to have users in the
>> wild who are depending on backwards compatibility, and I'm try to better
>> understand the odds that we break them.
>>
>> Thanks,
>> Micah
>>
>> On Thu, Sep 5, 2019 at 7:25 AM Wes McKinney <wesmckinn@gmail.com> wrote:
>>
>>> hi folks,
>>>
>>> One of the as-yet-untested (in integration tests) parts of the
>>> columnar specification is the Null layout. In C++ we additionally
>>> implemented this by writing two length-0 "placeholder" buffers in the
>>> RecordBatch data header, but since the Null layout has no memory
>>> allocated nor any buffers in-memory it may be more proper to write no
>>> buffers (since the length of the Null layout is all you need to
>>> reconstruct it). There are 3 implementations of the placeholder
>>> version (C++, Go, JS, maybe also C#) but it never got implemented in
>>> Java. While technically this would break old serialized data, I would
>>> not expect this to be very frequently occurring in many of the
>>> currently-deployed Arrow applications
>>>
>>> Here is my C++ patch
>>>
>>> https://github.com/apache/arrow/pull/5287
>>>
>>> I'm not sure we need to formalize this with a vote but I'm interested
>>> in the community's feedback on how to proceed here.
>>>
>>> - Wes
>>>

Mime
View raw message