From user-return-1226-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Fri May 7 07:55:14 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id B110018063F for ; Fri, 7 May 2021 09:55:14 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id DE0FE618FE for ; Fri, 7 May 2021 07:55:13 +0000 (UTC) Received: (qmail 32646 invoked by uid 500); 7 May 2021 07:55:12 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 32630 invoked by uid 99); 7 May 2021 07:55:11 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 May 2021 07:55:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 0DDC51FF3A4 for ; Fri, 7 May 2021 07:55:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: -0.001 X-Spam-Level: X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id 3uWXb8mNLm9P for ; Fri, 7 May 2021 07:55:10 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::230; helo=mail-lj1-x230.google.com; envelope-from=joris.mg.peeters@gmail.com; receiver= Received: from mail-lj1-x230.google.com (mail-lj1-x230.google.com [IPv6:2a00:1450:4864:20::230]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 601B97FDCB for ; Fri, 7 May 2021 07:55:10 +0000 (UTC) Received: by mail-lj1-x230.google.com with SMTP id b7so10430982ljr.4 for ; Fri, 07 May 2021 00:55:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=gzPidUWqIkoNxZZUvaGbjrRXzb1s6kj6Gq1aQic0s/k=; b=MfY1hSb/qZIaHyr1UOK92Leu94vdBP7ovTnP+lDzjv7DpeLJnkv3PJVO69F9w39jms Suqigw2LYHqeVkwrvr4hU4WJn697+ilWcUpYvbrmyAFrDLdRHDvL9WfzjBjYbxtuwodJ tDomP26SxySge0cTbWpmlhYLG7I1BTyr2GnyWhYPMXKUROJ2Yuj8L2TFpNw9EmtGnr4I KkaDe9QBBh1sKMpADxE+ltPEXnPmro+buyBVxd8eAx55vwxcnPZPblwNLw8I5vz+9Wzb zNHhHQmV3Y5YDlPNscSQKqP8vFYS9L8oPB1fj/slcwkaCn8V8fXaXNI2/1ZoGfA3ZGTK +nqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=gzPidUWqIkoNxZZUvaGbjrRXzb1s6kj6Gq1aQic0s/k=; b=SCi53zq85Q7FCXfWGxt6ml/P/Y/ETAal5W3KcK2B+1DzgeEX4p0ug/lS1P+oJa7GEO cMwcKq3TqLBTyS9Dd3quByDScCSQy/uLnFV0FLwY54sqNyX1aqtF6wq/akH6KOb+vbrq DAjMSYl5aVxKrAbcTXy+ecLURZgYbmnBHHcY2DEAIji0rzXplcJfjhd47RHAePPnd8c8 Ii1liS0anTRX9u7iukGnfVR3tCBWI7OcVgeYcIQGAywHG0xABCBA/URSIrwUfP+bsSN0 BUjo7kTEVYE/ezFGeCubBn/idGEYDN2IwxDEIXglSjvumlKERoj2SPOQrgQatw5Wk0uh 6EGg== X-Gm-Message-State: AOAM531VfGL2NvkuqRUtPrI4yvsnr0SnNguvHcrXhuU++biAksk7ZMKL WpO3se+kk8ZyVD28NFdU2IyDF75W/8rxngdPSBPmmM8O6nk= X-Google-Smtp-Source: ABdhPJwX0abi1R6lVGrMTbKQefK9sD6k5g6haNOOaZTVO5/TK5wTpJPHJUlWG4CwTGzgCcccySuA4lChZqZlsyNHjXQ= X-Received: by 2002:a2e:9701:: with SMTP id r1mr6771961lji.213.1620374103712; Fri, 07 May 2021 00:55:03 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Joris Peeters Date: Fri, 7 May 2021 08:54:53 +0100 Message-ID: Subject: Re: [Java JDBC adapter] non-nullable fields? To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000dae53505c1b8be79" --000000000000dae53505c1b8be79 Content-Type: text/plain; charset="UTF-8" Fair enough. I have this data moving through a few different servers and clients, in IPC streaming format, consumed on various platforms/languages. The nullability in the schema is often used in "language-friendly" clients, e.g. to build a `std::vector` or `std::vector>` depending on whether the bit column is nullable, so preserving this information is quite important, even if locally in Java it makes little difference. I've worked around it for now by fudging the VectorSchemaRoot's schema myself, but I'll open a JIRA to track, and I'll assign it to myself and provide a fix. Cheers! -Joris. On Fri, May 7, 2021 at 3:22 AM Fan Liya wrote: > Hi Joris, > > I think you are right. > > We only use the nullability information in the consumers, because it makes > a difference in performance. > > The nullability information in the schema is not accurate, as you have > observed. > However, such information is not well-used in the Java implementation > (IMHO). For example, the validity buffer is allocated even if the vector is > non-nullable. > > That said, I think it would be better to keep the nullability information > in sync. > So maybe we can open a JIRA to track it? > > Best, > Liya Fan > > > On Thu, May 6, 2021 at 3:09 PM Joris Peeters > wrote: > >> Hello Fan, >> >> Yes, but it seems that code path only affects the consumers, and whether >> they set a value in the vector or not, see e.g. >> https://github.com/apache/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/consumer/DoubleConsumer.java#L57 >> However, the VectorSchemaRoot's schema, defined I believe at >> https://github.com/apache/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/ArrowVectorIterator.java#L59, >> does not appear to use this info, and just sets every column's nullability >> to true (as per the link in my original email). >> >> Note that we are indeed using the ArrowVectorIterator, and it's when >> iterating over the iterator and inspecting the schema of the elements >> (VectorSchemaRoot) that I notice this. >> Maybe all this needs is a `isColumnNullable(i, ..)` instead of `true` in >> `final FieldType fieldType = new FieldType(true, arrowType, /* dictionary >> encoding */ null, metadata);`. >> >> Cheers, >> -J >> >> On Thu, May 6, 2021 at 5:53 AM Fan Liya wrote: >> >>> Hi Joris, >>> >>> Thanks for reporting the problem. >>> >>> We make use of the nullable information >>> in ArrowVectorIterator#initialize. (Details can be found in >>> https://github.com/apache/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/ArrowVectorIterator.java#L73 >>> ) >>> >>> Please note that the ArrowVectorIterator is our encouraged way of using >>> the JDBC adapter. >>> >>> Best, >>> Liya Fan >>> >>> >>> On Wed, May 5, 2021 at 1:42 PM Micah Kornfield >>> wrote: >>> >>>> I would need to look further, but I thought we handled null vs not >>>> null. At least I thought we had specialized conversion code to avoid >>>> branches. If this isn't the case it seems reasonable to contribute a path. >>>> >>>> On Tue, May 4, 2021 at 3:39 AM Joris Peeters < >>>> joris.mg.peeters@gmail.com> wrote: >>>> >>>>> I'm looking to use the Java JDBC adapter for loading tables from SQL >>>>> Server into Arrow record batches. >>>>> >>>>> At first glance the Arrow JDBC adapter seems to work well but, unless >>>>> I'm mistaken, it simply makes every vector nullable, irrespective of >>>>> whether the corresponding SQL column is nullable or not. >>>>> >>>>> I think the line >>>>> >>>>> final FieldType fieldType = new FieldType(true, arrowType, /* >>>>> dictionary encoding */ null, metadata); >>>>> >>>>> in >>>>> https://github.com/apache/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java#L158 >>>>> might be the cause here. >>>>> >>>>> Is my interpretation correct, or am I missing a setting of sorts? If >>>>> indeed correct, is there a fundamental reason the NULL-ness is not >>>>> transferred, or is this something I could contribute in a PR? (which I'd be >>>>> happy to) I guess it's just a matter of inspecting the result metadata. >>>>> >>>>> Cheers, >>>>> -J >>>>> >>>> --000000000000dae53505c1b8be79 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Fair enough.=C2=A0
I have this data moving through a f= ew different servers and clients, in IPC streaming format, consumed on vari= ous platforms/languages. The nullability in the schema is often used in &qu= ot;language-friendly" clients, e.g. to build a `std::vector<bool>= ;` or `std::vector<std::optional<bool>>` depending on whether t= he bit column is nullable, so preserving this information is quite importan= t, even if locally in Java it makes little difference.=C2=A0

=
I've worked around it for now by fudging the VectorSchemaRoo= t's schema myself, but I'll open a JIRA to track, and I'll assi= gn it to myself and provide a fix.

Cheers!
-Joris.


On Fri, May 7, 2021 at 3:22 AM Fan Liya <= liya.fan03@gmail.com> wrote:=
Hi Joris,

I think you are right.

<= div>We only use the nullability information in the consumers, because it ma= kes a difference in performance.

The nullability i= nformation in the schema is not accurate, as you have observed.=C2=A0
=
However, such information is not well-used in the Java implementation = (IMHO). For example, the validity buffer is allocated even if the vector is= non-nullable.

That said, I think it would be bett= er to keep the nullability information in sync.
So maybe we can o= pen a JIRA to track it?

Best,
Liya Fan
=C2=A0

On Thu, May 6, 2021 at 3:09 PM Joris Peeters <joris.mg.peeters= @gmail.com> wrote:
Hello Fan,

Yes, but it seems= that code path only affects the consumers, and whether they set a value in= the vector or not, see e.g.=C2=A0https://github.com/apa= che/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apache/arrow/adap= ter/jdbc/consumer/DoubleConsumer.java#L57
However, the Vector= SchemaRoot's schema, defined I believe at=C2=A0https://g= ithub.com/apache/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apac= he/arrow/adapter/jdbc/ArrowVectorIterator.java#L59, does not appear to = use this info, and just sets every column's nullability to true (as per= the link in my original email).

Note that we are = indeed using the ArrowVectorIterator, and it's when iterating over the = iterator and inspecting the schema of the elements (VectorSchemaRoot) that = I notice this.
Maybe all this needs is a `isColumnNullable(i, ..)= ` instead of `true` in `final FieldType fieldType =3D new FieldType(true, a= rrowType, /* dictionary encoding */ null, metadata);`.

Cheers,
-J

On Thu, May 6, 2021 at 5:53 AM Fan Liya = <liya.fan03@gm= ail.com> wrote:
Hi Joris,

Thanks for reporting t= he problem.

We make use of the nullable informatio= n in=C2=A0ArrowVectorIterator#initialize. (Details can be found in https://github.com/apache/arrow/blob/master/java/adapter/jdbc/src/m= ain/java/org/apache/arrow/adapter/jdbc/ArrowVectorIterator.java#L73)

Please note that the=C2=A0 ArrowVectorIterator is our encouraged way of using the JDBC adapter.
<= div>
Best,
Liya Fan


<= div class=3D"gmail_quote">
On Wed, May= 5, 2021 at 1:42 PM Micah Kornfield <emkornfield@gmail.com> wrote:
I would need = to look further, but I thought we handled null vs not null.=C2=A0 At least = I thought we had specialized conversion code to avoid branches.=C2=A0 If th= is isn't the case it seems reasonable to contribute a path.

On Tue, May = 4, 2021 at 3:39 AM Joris Peeters <joris.mg.peeters@gmail.com> wrote:
I'= m looking to use the Java JDBC adapter for loading tables from SQL Server i= nto Arrow record batches.=C2=A0

At first glance the Arro= w JDBC adapter seems to work well but, unless I'm mistaken, it simply m= akes every vector nullable, irrespective of whether the corresponding SQL c= olumn is nullable or not.=C2=A0

I think the line= =C2=A0

final FieldType fieldType =3D new FieldType= (true, arrowType, /* dictionary encoding */ null, metadata);

=

Is my i= nterpretation correct, or am I missing a setting of sorts? If indeed correc= t, is there a fundamental reason the NULL-ness is not transferred, or is th= is something I could contribute in a PR? (which I'd be happy to) I gues= s it's just a matter of inspecting the result metadata.

<= /div>
Cheers,
-J
--000000000000dae53505c1b8be79--