From user-return-1174-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Wed Apr 14 19:57:20 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id D51A61804BB for ; Wed, 14 Apr 2021 21:57:19 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 0588A428B5 for ; Wed, 14 Apr 2021 19:57:19 +0000 (UTC) Received: (qmail 95311 invoked by uid 500); 14 Apr 2021 19:57:18 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 95299 invoked by uid 99); 14 Apr 2021 19:57:18 -0000 Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Apr 2021 19:57:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id C8DD5C0439 for ; Wed, 14 Apr 2021 19:57:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024) with ESMTP id dVfFhhGAO4ai for ; Wed, 14 Apr 2021 19:57:17 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::633; helo=mail-pl1-x633.google.com; envelope-from=weston.pace@gmail.com; receiver= Received: from mail-pl1-x633.google.com (mail-pl1-x633.google.com [IPv6:2607:f8b0:4864:20::633]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id C5B2F7FD9B for ; Wed, 14 Apr 2021 19:57:16 +0000 (UTC) Received: by mail-pl1-x633.google.com with SMTP id c2so3305111plz.0 for ; Wed, 14 Apr 2021 12:57:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=EGKIvQ/3zjQeTOwA4Wnj/PbZfwnEMDrhnap77QdYlaM=; b=jlUcA+58vXnd+mivVvW29m2u6YhS+Y4ANXC1rSrwnHOMEY8aHGAKIaWc6YJV3Wu/Ja 4aBo4RgPj/B52hqLzepdYR5KjpNU5CaNKR6GvklbBHGb2sYHzFCOd9VEL4fm94BQXFYZ 8E2g1D/LtaaCBnN1yN9gQEhI7CLASCyZM40tUPaIHMXlaAmtlKNNUWO+VTBQsU8Tn4Ba sZmmHjkUeJaQLAFjyvn+vOaSM8U/ET3WckZzqTTD4kOg8VG4CqOkLQ8f0UoWiSr402Cz RRmGvrSUQRJuygemfDXlr4bM3iCLf2XwQaunBOGiB5QdSObhig5dQ64Y5DTzLHx4pnB5 DQDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=EGKIvQ/3zjQeTOwA4Wnj/PbZfwnEMDrhnap77QdYlaM=; b=NM8HbVluq96I/dPYucA5jOCOJlHH2e5UhhSpftdkW0UXsIDRgNQb6Z7XFPD5rIfS5W awmZ6NwXogERk2XrRhE2wroYyqTfT02SuV8rmkF7iOQqCAxjuIBWp6IAVvp5YNMADv5D ZANCtguCFExQThiXIoNa/dUlqfZYx7jHdQufLIQGoriddb8FoCN0nhRyej3Fe+bWNVc4 nsuFrAYjpTO+VoxSuORv6r7crCX5YqTVIH25b3VS9ukpOKBvLcln1ExVdN8tuQyhNExE dDZrZ8Ur98TBGNSodTHQEp/thQYRv68SMyPxccQGtkFZMSLOymDiLedwSS6QQu102AJQ OCsQ== X-Gm-Message-State: AOAM532uuMX8fLgNlGKRbgFz7JDLwVhJOwZyspQoWo0sAIcthqsGAtt2 +NmcbVp3YV6omEwzNkvRnAwLpo6zgOQAQKeQxHZFlEj4Ga8awQ== X-Google-Smtp-Source: ABdhPJxWnmISKW8OPdfnl+Xzw5eFX9njuPPr3jtPze4f4HwYUcKF8KtVy2F0ZAy7u8jXibdqp/HQFCvr9yUJjguBqdk= X-Received: by 2002:a17:90b:784:: with SMTP id l4mr5391908pjz.90.1618430234895; Wed, 14 Apr 2021 12:57:14 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Weston Pace Date: Wed, 14 Apr 2021 09:57:03 -1000 Message-ID: Subject: Re: [Cython] Getting and Comparing String Scalars? To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="0000000000003eaa5505bff42751" --0000000000003eaa5505bff42751 Content-Type: text/plain; charset="UTF-8" If you don't need the performance, you could stay in python (use to_pylist() for the array or as_py() for scalars). If you do need the performance then you're probably better served getting the buffers and operating on them directly. Or, even better, making use of the compute kernels: arr = pa.array(['abc', 'ab', 'Xander', None], pa.string()) desired = pa.array(['Xander'], pa.string()) pc.any(pc.is_in(arr, value_set=desired)).as_py() # True On Wed, Apr 14, 2021 at 6:29 AM Xander Dunn wrote: > This works for getting a c string out of the CScalar: > ``` > name_buffer = > (GetResultValue(names.get().\ > GetScalar(batch_row_index)).get()).value > name = name_buffer.get().data() > ``` > > > On Tue, Apr 13, 2021 at 10:43 PM, Xander Dunn wrote: > >> Here is an example code snippet from a .pyx file that successfully >> iterates through a CRecordBatch and ensures that the timestamps are >> ascending: >> ``` >> while batch_row_index < batch.get().num_rows(): >> timestamp = >> GetResultValue(times.get().GetScalar(batch_row_index)) >> new_timestamp = timestamp.get() >> current_timestamp = timestamps[name] >> if current_timestamp > new_timestamp.value: >> abort() >> batch_row_index += 1 >> ``` >> >> However, I'm having difficulty operating on the values in a column of >> string type. Unlike CTimestampScalar, there is no CStringScalar. Although >> there is a StringScalar type in C++, it isn't defined in the Cython >> interface. There is a `CStringType` and a `c_string` type. >> ``` >> while batch_row_index < batch.get().num_rows(): >> name = GetResultValue(names.get().GetScalar(batch_row_index)) >> name_string = name.get() # This is wrong >> printf("%s\n", name_string) # This prints garbage >> if name_string == b"Xander": # Doesn't work >> print("found it") >> batch_row_index += 1 >> ``` >> How do I get the string value as a C type and compare it to other >> strings? >> >> Thanks, >> Xander >> > > --0000000000003eaa5505bff42751 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
If you don't need the performance, you could stay= in python (use to_pylist() for the array or as_py() for scalars).

If you do need the performance then you're probably be= tter served getting the buffers and operating on them directly.=C2=A0 Or, e= ven better, making use of the compute kernels:

arr= =3D pa.array(['abc', 'ab', 'Xander', None], pa.str= ing())
desired =3D pa.array(['Xander'], pa.string())
pc.any(pc.is_in(arr, value_set=3Ddesired)).as_py() # True

On= Wed, Apr 14, 2021 at 6:29 AM Xander Dunn <xander@xander.ai> wrote:
This works for getting a c = string out of the CScalar:
```
=C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 name_buffer =3D (<CBaseBinaryScalar*= >GetResultValue(names.get().\
=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 GetScalar(batch_row_index)).get()).val= ue
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 na= me =3D <char *>name_buffer.get().data()
```
=
3D""


On Tue, Apr 13, 2021 at 10:43 = PM, Xander Dunn <xander@xander.ai> wrote:
Here is= an example code snippet from a .pyx file that successfully iterates throug= h a CRecordBatch and ensures that the timestamps are ascending:
```
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 while batch_row_index < batch.get().num_rows():
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 timestamp =3D GetResultValue(times.get().GetScalar(batch= _row_index))
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 new_timestamp =3D <CTimestamp= Scalar*>timestamp.get()
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 current_timestamp= =3D timestamps[name]
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if current_timestamp &g= t; new_timestamp.value:
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 abort()
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 batch_row_index +=3D 1
=
```

However, I'm having difficulty op= erating on the values in a column of string type. Unlike CTimestampScalar, = there is no CStringScalar. Although there is a StringScalar type in C++, it= isn't defined in the Cython interface. There is a `CStringType` and a = `c_string` type.
```
=C2=A0=C2=A0=C2=A0 while b= atch_row_index < batch.get().num_rows():
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 name =3D GetResultValue(names.get().GetScalar(batch_row_index))<= br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 name_string =3D <CStringType*&= gt;name.get() # This is wrong
=C2=A0 =C2=A0 =C2=A0 =C2=A0 pri= ntf("%s\n", name_string) # This prints garbage
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 if name_string =3D=3D b"Xander": # Doesn= 't work
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 print(&= quot;found it")
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 batch_row_index +=3D 1
```
How do I get the= string value as a C type and compare it to other strings?=C2=A0
<= div>
Thanks,
Xander

--0000000000003eaa5505bff42751--