arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fan Liya <liya.fa...@gmail.com>
Subject Re: [Java JDBC adapter] non-nullable fields?
Date Fri, 07 May 2021 12:23:13 GMT
Thanks for your effort.
I'd like to help with the code review.

Best,
Liya Fan


On Fri, May 7, 2021 at 5:20 PM Joris Peeters <joris.mg.peeters@gmail.com>
wrote:

> https://issues.apache.org/jira/browse/ARROW-12679
>
> On Fri, May 7, 2021 at 8:54 AM Joris Peeters <joris.mg.peeters@gmail.com>
> wrote:
>
>> Fair enough.
>> I have this data moving through a few different servers and clients, in
>> IPC streaming format, consumed on various platforms/languages. The
>> nullability in the schema is often used in "language-friendly" clients,
>> e.g. to build a `std::vector<bool>` or `std::vector<std::optional<bool>>`
>> depending on whether the bit column is nullable, so preserving this
>> information is quite important, even if locally in Java it makes little
>> difference.
>>
>> I've worked around it for now by fudging the VectorSchemaRoot's schema
>> myself, but I'll open a JIRA to track, and I'll assign it to myself and
>> provide a fix.
>>
>> Cheers!
>> -Joris.
>>
>>
>> On Fri, May 7, 2021 at 3:22 AM Fan Liya <liya.fan03@gmail.com> wrote:
>>
>>> Hi Joris,
>>>
>>> I think you are right.
>>>
>>> We only use the nullability information in the consumers, because it
>>> makes a difference in performance.
>>>
>>> The nullability information in the schema is not accurate, as you have
>>> observed.
>>> However, such information is not well-used in the Java implementation
>>> (IMHO). For example, the validity buffer is allocated even if the vector is
>>> non-nullable.
>>>
>>> That said, I think it would be better to keep the nullability
>>> information in sync.
>>> So maybe we can open a JIRA to track it?
>>>
>>> Best,
>>> Liya Fan
>>>
>>>
>>> On Thu, May 6, 2021 at 3:09 PM Joris Peeters <joris.mg.peeters@gmail.com>
>>> wrote:
>>>
>>>> Hello Fan,
>>>>
>>>> Yes, but it seems that code path only affects the consumers, and
>>>> whether they set a value in the vector or not, see e.g.
>>>> https://github.com/apache/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/consumer/DoubleConsumer.java#L57
>>>> However, the VectorSchemaRoot's schema, defined I believe at
>>>> https://github.com/apache/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/ArrowVectorIterator.java#L59,
>>>> does not appear to use this info, and just sets every column's nullability
>>>> to true (as per the link in my original email).
>>>>
>>>> Note that we are indeed using the ArrowVectorIterator, and it's when
>>>> iterating over the iterator and inspecting the schema of the elements
>>>> (VectorSchemaRoot) that I notice this.
>>>> Maybe all this needs is a `isColumnNullable(i, ..)` instead of `true`
>>>> in `final FieldType fieldType = new FieldType(true, arrowType, /*
>>>> dictionary encoding */ null, metadata);`.
>>>>
>>>> Cheers,
>>>> -J
>>>>
>>>> On Thu, May 6, 2021 at 5:53 AM Fan Liya <liya.fan03@gmail.com> wrote:
>>>>
>>>>> Hi Joris,
>>>>>
>>>>> Thanks for reporting the problem.
>>>>>
>>>>> We make use of the nullable information
>>>>> in ArrowVectorIterator#initialize. (Details can be found in
>>>>> https://github.com/apache/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/ArrowVectorIterator.java#L73
>>>>> )
>>>>>
>>>>> Please note that the  ArrowVectorIterator is our encouraged way of
>>>>> using the JDBC adapter.
>>>>>
>>>>> Best,
>>>>> Liya Fan
>>>>>
>>>>>
>>>>> On Wed, May 5, 2021 at 1:42 PM Micah Kornfield <emkornfield@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I would need to look further, but I thought we handled null vs not
>>>>>> null.  At least I thought we had specialized conversion code to avoid
>>>>>> branches.  If this isn't the case it seems reasonable to contribute
a path.
>>>>>>
>>>>>> On Tue, May 4, 2021 at 3:39 AM Joris Peeters <
>>>>>> joris.mg.peeters@gmail.com> wrote:
>>>>>>
>>>>>>> I'm looking to use the Java JDBC adapter for loading tables from
SQL
>>>>>>> Server into Arrow record batches.
>>>>>>>
>>>>>>> At first glance the Arrow JDBC adapter seems to work well but,
>>>>>>> unless I'm mistaken, it simply makes every vector nullable, irrespective
of
>>>>>>> whether the corresponding SQL column is nullable or not.
>>>>>>>
>>>>>>> I think the line
>>>>>>>
>>>>>>> final FieldType fieldType = new FieldType(true, arrowType, /*
>>>>>>> dictionary encoding */ null, metadata);
>>>>>>>
>>>>>>> in
>>>>>>> https://github.com/apache/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java#L158
>>>>>>> might be the cause here.
>>>>>>>
>>>>>>> Is my interpretation correct, or am I missing a setting of sorts?
If
>>>>>>> indeed correct, is there a fundamental reason the NULL-ness is
not
>>>>>>> transferred, or is this something I could contribute in a PR?
(which I'd be
>>>>>>> happy to) I guess it's just a matter of inspecting the result
metadata.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> -J
>>>>>>>
>>>>>>

Mime
View raw message