From user-return-1176-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Wed Apr 14 23:46:06 2021
Return-Path: <user-return-1176-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 61B4E1804BB
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 15 Apr 2021 01:46:06 +0200 (CEST)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 942A643355
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 14 Apr 2021 23:46:05 +0000 (UTC)
Received: (qmail 81595 invoked by uid 500); 14 Apr 2021 23:46:05 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 81585 invoked by uid 99); 14 Apr 2021 23:46:05 -0000
Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Apr 2021 23:46:05 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 58BFD1FF465
	for <user@arrow.apache.org>; Wed, 14 Apr 2021 23:46:04 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0.012
X-Spam-Level:
X-Spam-Status: No, score=0.012 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H3=0.001,
	RCVD_IN_MSPIKE_WL=0.001, SPF_PASS=-0.001, T_FILL_THIS_FORM_SHORT=0.01,
	URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamproc1-he-de.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-ec2-va.apache.org ([116.203.227.195])
	by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024)
	with ESMTP id oZKkF-_g227z for <user@arrow.apache.org>;
	Wed, 14 Apr 2021 23:46:03 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.218.49; helo=mail-ej1-f49.google.com; envelope-from=emkornfield@gmail.com; receiver=<UNKNOWN> 
Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 1B860BCD4E
	for <user@arrow.apache.org>; Wed, 14 Apr 2021 23:46:03 +0000 (UTC)
Received: by mail-ej1-f49.google.com with SMTP id u17so34031106ejk.2
        for <user@arrow.apache.org>; Wed, 14 Apr 2021 16:46:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:reply-to:from:date:message-id
         :subject:to;
        bh=+Dwgdovlavu5x50yb/a2DudrKCYfNPkgL3h6kn8KdFg=;
        b=dgQ8QXxbmY24DRkSQoz+wCYwPFmybCt21EOTVeZ3MIVWD5P2NXuTVHzCtO8lcrukEx
         c7qSax71YHdQLcftqe+SvkaE84Bvs9BnbYLXlbZG2UN+DcCuXoXxDv063UpMrwTSeox9
         lPnI0vafnosXjhjx+lp9VKwaSwQGcaGmK1jY5BPTvfG0HPrCmVJDLh6Bh2SdxchxXrpC
         ssUXTD2nDtxbz9KSxzSkjewEqdNqy+IYD4dHRfOjL7oEmfvRNnh+HZiW1STpyNyKW2wG
         JtEjmuNROToEeMpiWyHCVZvXzQEXlGFBeECQHEQns7S3EeopDlhuWyMEVB6LOX4nTHlv
         zkgA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:reply-to
         :from:date:message-id:subject:to;
        bh=+Dwgdovlavu5x50yb/a2DudrKCYfNPkgL3h6kn8KdFg=;
        b=sGuLZrIgzJ9qW7hl/lFaCAZ0nAgGV9OMxTcbHs1yrk+zClB18O//hm3AEWdIJNtXPQ
         R1WnDJeswIQAK+d6NEsoX7yAKiraxCOcLiKZ0Yd18os850s5tKezWlJuEoPtY/jRSjBL
         K6rtiN0MKq++N0oLixwrLVtnPajLI78DAw6eOYUEPBjyeLUa5H0z8Awy9TH3OujKy5to
         ckoJnvId6ighCBtd9oMPYVmWSL+mZpifm1yfZSc8VHo+a8MXNf2ZNxyKvnTKibwbMQEC
         CUcReqqwwcnlGnGYhzdF19jnkZVHJaCf/uO4wQSV/zHcvOgNJTC+Bzz4SiWsem1/UFrP
         aslQ==
X-Gm-Message-State: AOAM531cyDERKHob0ZohKlJRwjUuybQzUoEZiKQLtUm6LHQOPXyEJndt
	UDr8o6S+YOUkjHa11l1148zqoj8bawF3kHVgZp9Wu9Wq2o8OLw==
X-Google-Smtp-Source: ABdhPJzo2QhAr2g5W5kWPOJjPLaQ8c22N4ZB0NTsD/u7K0Y9vw1cMjoqdBtlLv0xj11C+jSGeg7o97UeNtsyQBTLFTM=
X-Received: by 2002:a17:906:250d:: with SMTP id i13mr573630ejb.474.1618443961756;
 Wed, 14 Apr 2021 16:46:01 -0700 (PDT)
MIME-Version: 1.0
References: <kngpdtfw.e21d428b-1221-4931-b8f1-14cf43220c11@we.are.superhuman.com>
 <knhno86k.5c013350-16bc-43e3-be86-81656a7c1af3@we.are.superhuman.com>
 <CAE4AYb2TobCXiAJQ9rOsdPR+2ZVbsq8MTsbM8fA+wOzfZGuGGg@mail.gmail.com> <knhw4uta.de9e76ca-f695-4b44-8278-02a2f7bf106d@we.are.superhuman.com>
In-Reply-To: <knhw4uta.de9e76ca-f695-4b44-8278-02a2f7bf106d@we.are.superhuman.com>
Reply-To: emkornfield@gmail.com
From: Micah Kornfield <emkornfield@gmail.com>
Date: Wed, 14 Apr 2021 16:45:50 -0700
Message-ID: <CAK7Z5T_6KR5OBOVztc10g5v9nO9AMLboFbPpdE+s_w_s+qrsVg@mail.gmail.com>
Subject: Re: [Cython] Getting and Comparing String Scalars?
To: user@arrow.apache.org
Content-Type: multipart/alternative; boundary="0000000000006df12405bff75904"

--0000000000006df12405bff75904
Content-Type: text/plain; charset="UTF-8"

Have you looked at the pyarrow compute functions [1][2]?

Unique and filter seems like they would help.

[1]
https://arrow.apache.org/docs/python/compute.html?highlight=pyarrow%20compute
[2] https://arrow.apache.org/docs/cpp/compute.html#compute-function-list

On Wed, Apr 14, 2021 at 2:02 PM Xander Dunn <xander@xander.ai> wrote:

> Thanks Weston,
>
> Performance is paramount here, I'm streaming through 7TB of data.
>
> I actually need to separate the data based on the value of the `name`
> column. For every unique value of `name`, I need a batch of those rows. I
> tried using gandiva's filter function but can't get gandiva installed on
> Ubuntu (see my earlier thread "[Python] pyarrow.gandiva unavailable on
> Ubuntu?" on this mailing list).
>
> Aside from that, I'm not sure of a way to separate the data faster than
> iterating through every row and placing the values into a map keyed on
> `name`:
> ```
> cdef struct myUpdateStruct:
>     double value
>     int64_t checksum
>
> cdef iterate_dataset():
>     cdef map[c_string, deque[myUpdateStruct]] myUpdates
>     cdef shared_ptr[CRecordBatch] batch # This is populated by a scanner
> of .parquet files
>     cdef int64_t batch_row_index = 0
>     while batch_row_index < batch.get().num_rows():
>         name_buffer = (<CBaseBinaryScalar*>GetResultValue(names.get().\
>                 GetScalar(batch_row_index)).get()).value
>         name = <char *>name_buffer.get().data()
>         value = (<CDoubleScalar*>GetResultValue(values.get().\
>                 GetScalar(batch_row_index)).get()).value
>         checksum = (<CInt64Scalar*>GetResultValue(checksums.get().\
>                 GetScalar(batch_row_index)).get()).value
>         newUpdate = myUpdateStruct(value, checksum)
>         if myUpdates.count(name) <= 0:
>             myUpdates[name] = deque[myUpdateStruct]()
>         myUpdates[name].push_front(newUpdate)
>         if myUpdates[name].size() > 1024:
>             myUpdates[name].pop_back()
>         batch_row_index += 1
> ```
> This takes 107minutes to iterate through the first 290GB of data. Without
> accessing or filtering the data in any way it takes only 12min to read all
> the .parquet files into RecordBatches and place them into Plasma.
>
>
> On Wed, Apr 14, 2021 at 12:57 PM, Weston Pace <weston.pace@gmail.com>
> wrote:
>
>> If you don't need the performance, you could stay in python (use
>> to_pylist() for the array or as_py() for scalars).
>>
>> If you do need the performance then you're probably better served getting
>> the buffers and operating on them directly.  Or, even better, making use of
>> the compute kernels:
>>
>> arr = pa.array(['abc', 'ab', 'Xander', None], pa.string())
>> desired = pa.array(['Xander'], pa.string())
>> pc.any(pc.is_in(arr, value_set=desired)).as_py() # True
>>
>> On Wed, Apr 14, 2021 at 6:29 AM Xander Dunn <xander@xander.ai> wrote:
>>
>>> This works for getting a c string out of the CScalar:
>>> ```
>>>                 name_buffer =
>>> (<CBaseBinaryScalar*>GetResultValue(names.get().\
>>>                         GetScalar(batch_row_index)).get()).value
>>>                 name = <char *>name_buffer.get().data()
>>> ```
>>>
>>>
>>> On Tue, Apr 13, 2021 at 10:43 PM, Xander Dunn <xander@xander.ai> wrote:
>>>
>>>> Here is an example code snippet from a .pyx file that successfully
>>>> iterates through a CRecordBatch and ensures that the timestamps are
>>>> ascending:
>>>> ```
>>>>             while batch_row_index < batch.get().num_rows():
>>>>                 timestamp =
>>>> GetResultValue(times.get().GetScalar(batch_row_index))
>>>>                 new_timestamp = <CTimestampScalar*>timestamp.get()
>>>>                 current_timestamp = timestamps[name]
>>>>                 if current_timestamp > new_timestamp.value:
>>>>                     abort()
>>>>                 batch_row_index += 1
>>>> ```
>>>>
>>>> However, I'm having difficulty operating on the values in a column of
>>>> string type. Unlike CTimestampScalar, there is no CStringScalar. Although
>>>> there is a StringScalar type in C++, it isn't defined in the Cython
>>>> interface. There is a `CStringType` and a `c_string` type.
>>>> ```
>>>>     while batch_row_index < batch.get().num_rows():
>>>>         name = GetResultValue(names.get().GetScalar(batch_row_index))
>>>>         name_string = <CStringType*>name.get() # This is wrong
>>>>         printf("%s\n", name_string) # This prints garbage
>>>>         if name_string == b"Xander": # Doesn't work
>>>>             print("found it")
>>>>         batch_row_index += 1
>>>> ```
>>>> How do I get the string value as a C type and compare it to other
>>>> strings?
>>>>
>>>> Thanks,
>>>> Xander
>>>>
>>>
>

--0000000000006df12405bff75904
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Have you looked at the pyarrow compute functions [1][2]?=
=C2=A0=C2=A0<div><br></div><div>Unique and filter seems like they would hel=
p.<br><div><br></div><div>[1]=C2=A0<a href=3D"https://arrow.apache.org/docs=
/python/compute.html?highlight=3Dpyarrow%20compute">https://arrow.apache.or=
g/docs/python/compute.html?highlight=3Dpyarrow%20compute</a></div><div>[2]=
=C2=A0<a href=3D"https://arrow.apache.org/docs/cpp/compute.html#compute-fun=
ction-list">https://arrow.apache.org/docs/cpp/compute.html#compute-function=
-list</a></div></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" =
class=3D"gmail_attr">On Wed, Apr 14, 2021 at 2:02 PM Xander Dunn &lt;<a hre=
f=3D"mailto:xander@xander.ai">xander@xander.ai</a>&gt; wrote:<br></div><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left=
:1px solid rgb(204,204,204);padding-left:1ex"><div><div><div><div><div>Than=
ks Weston,<br></div><div><br></div><div>Performance is paramount here, I=
9;m streaming through 7TB of data.<br></div><div><br></div><div>I actually =
need to separate the data based on the value of the `name` column. For ever=
y unique value of `name`, I need a batch of those rows. I tried using gandi=
va&#39;s filter function but can&#39;t get gandiva installed on Ubuntu (see=
 my earlier thread &quot;<span style=3D"text-decoration-color:initial;text-=
decoration-style:initial">[Python] pyarrow.gandiva unavailable on Ubuntu?</=
span>&quot; on this mailing list).=C2=A0<br></div><div><br></div><div>Aside=
 from that, I&#39;m not sure of a way to separate the data faster than iter=
ating through every row and placing the values into a map keyed on `name`:<=
br></div><div>```<br></div><div>cdef struct myUpdateStruct:<br></div><div>=
=C2=A0=C2=A0=C2=A0 double value<br></div><div>=C2=A0 =C2=A0 int64_t checksu=
m</div><div><div><br></div><div>cdef iterate_dataset():</div></div><div>=C2=
=A0 =C2=A0 cdef map[c_string, deque[myUpdateStruct]] myUpdates<br></div><di=
v>=C2=A0 =C2=A0 cdef shared_ptr[CRecordBatch] batch # This is populated by =
a scanner of .parquet files<br></div><div>=C2=A0 =C2=A0 cdef int64_t batch_=
row_index =3D 0</div><div>=C2=A0=C2=A0=C2=A0 while batch_row_index &lt; bat=
ch.get().num_rows():<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 name_buffer =
=3D (&lt;CBaseBinaryScalar*&gt;GetResultValue(names.get().\<br></div><div>=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 GetScalar(batch_row_index)).get()).value<br></div><div>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 name =3D &lt;char *&gt;name_buffer.get().data()<br=
></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 value =3D (&lt;CDoubleScalar*&gt;Ge=
tResultValue(values.get().\<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 GetScalar(batch_r=
ow_index)).get()).value<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 checksum =3D (&lt;CInt64Scalar*&gt;GetResultValue(checksums.get().\<=
br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 GetScalar(batch_row_index)).get()).value<br>=
</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 newUpdate =3D myUpdateStruct(value, =
checksum)<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if myUpd=
ates.count(name) &lt;=3D 0:<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 myUpdates[name] =3D deque[myUpdateStruct]()<br></div><div>=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0myUpdates[name].push_front(newUpd=
ate)<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if=C2=A0myUpd=
ates[name].size() &gt; 1024:<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0myUpdates[name].pop_back()<br></d=
iv><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 batch_row_index +=3D 1</div><div>```<br=
></div><div>This takes 107minutes to iterate through the first 290GB of dat=
a. Without accessing or filtering the data in any way it takes only 12min t=
o read all the .parquet files into RecordBatches and place them into Plasma=
.<br></div></div><div><div style=3D"display:none;border:0px;width:0px;heigh=
t:0px;overflow:hidden"><img src=3D"https://r.superhuman.com/ZRudjjHL5WvdCWL=
8iKXfwlL74aNhHDwN4ijcY8ZzocKWgS6MmVKwgROmD1J_wwyS6XdTNpvqNmpzUNShsTki6eiBhe=
7gURNErEhnxiiQkSEyDo7hcqTF_qYxj-5AGLWxhEd3gaqc1iGDtoMakboPZHABB4i1YBOQ2pVCj=
dFrQu4OzC2gibCzaks.gif" alt=3D"" width=3D"1" height=3D"0" style=3D"display:=
 none; border: 0px; width: 0px; height: 0px; overflow: hidden;"></div><br><=
div></div></div><br><div><div class=3D"gmail_quote">On Wed, Apr 14, 2021 at=
 12:57 PM, Weston Pace <span dir=3D"ltr">&lt;<a href=3D"mailto:weston.pace@=
gmail.com" target=3D"_blank">weston.pace@gmail.com</a>&gt;</span> wrote:<br=
><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border=
-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=3D"gmail_extr=
a"><div class=3D"gmail_quote" id=3D"gmail-m_6405051516286842027null"><div d=
ir=3D"ltr"><div>If you don&#39;t need the performance, you could stay in py=
thon (use to_pylist() for the array or as_py() for scalars).</div><div><br>=
</div><div>If you do need the performance then you&#39;re probably better s=
erved getting the buffers and operating on them directly.=C2=A0 Or, even be=
tter, making use of the compute kernels:</div><div><br></div><div>arr =3D p=
a.array([&#39;abc&#39;, &#39;ab&#39;, &#39;Xander&#39;, None], pa.string())=
</div><div>desired =3D pa.array([&#39;Xander&#39;], pa.string())</div><div>=
pc.any(pc.is_in(arr, value_set=3Ddesired)).as_py() # True<br></div></div><b=
r><div class=3D"gmail_quote"><div class=3D"gmail_attr" dir=3D"ltr">On Wed, =
Apr 14, 2021 at 6:29 AM Xander Dunn &lt;<a href=3D"mailto:xander@xander.ai"=
 rel=3D"noopener noreferrer" target=3D"_blank">xander@xander.ai</a>&gt; wro=
te:<br></div><blockquote style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(204,204,204);padding-left:1ex" class=3D"gmail_quote"><div><div><d=
iv><div><div>This works for getting a c string out of the CScalar:</div><di=
v>```<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 name_buffer =3D (&lt;CBaseBinaryScalar*&gt;GetResultValue(names.get().\<br=
></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 GetScalar(batch_row_index)).get()).value<br></div><div>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 name =3D &lt;char *&gt;name_buffer.g=
et().data()<br></div><div>```</div></div><div><br><div></div></div><br><div=
><div class=3D"gmail_quote">On Tue, Apr 13, 2021 at 10:43 PM, Xander Dunn <=
span dir=3D"ltr">&lt;<a href=3D"mailto:xander@xander.ai" rel=3D"noopener no=
referrer" target=3D"_blank">xander@xander.ai</a>&gt;</span> wrote:<br><bloc=
kquote style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,=
204);padding-left:1ex" class=3D"gmail_quote"><div class=3D"gmail_extra"><di=
v id=3D"gmail-m_6405051516286842027gmail-m_-2113903264806023168null" class=
=3D"gmail_quote"><div><div><div>Here is an example code snippet from a .pyx=
 file that successfully iterates through a CRecordBatch and ensures that th=
e timestamps are ascending:<br></div><div>```<br></div><div>=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 while batch_row_index &=
lt; batch.get().num_rows():<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 timestamp =3D Get=
ResultValue(times.get().GetScalar(batch_row_index))<br></div><div>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 new_timestamp =3D &lt;CTimestampScalar*&gt;timestamp.get()<br></div>=
<div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 current_timestamp =3D timestamps[name]<br></div><div>=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 if current_timestamp &gt; new_timestamp.value:<br></div><di=
v>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 abort()<br></div><div>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 batch_row_index +=3D 1<br></div><div>```<br></div><div><br></div><di=
v>However, I&#39;m having difficulty operating on the values in a column of=
 string type. Unlike CTimestampScalar, there is no CStringScalar. Although =
there is a StringScalar type in C++, it isn&#39;t defined in the Cython int=
erface. There is a `CStringType` and a `c_string` type.<br></div><div>```<b=
r></div><div>=C2=A0=C2=A0=C2=A0 while batch_row_index &lt; batch.get().num_=
rows():<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 name =3D GetResultValue(n=
ames.get().GetScalar(batch_row_index))<br></div><div>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 name_string =3D &lt;CStringType*&gt;name.get() # This is wrong<br></=
div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 printf(&quot;%s\n&quot;, name_string) =
# This prints garbage<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if name_str=
ing =3D=3D b&quot;Xander&quot;: # Doesn&#39;t work<br></div><div>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 print(&quot;found it&quot;)<br></div><div>=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 batch_row_index +=3D 1<br></div>=
<div>```<br></div><div>How do I get the string value as a C type and compar=
e it to other strings?=C2=A0<br></div><div><br></div><div>Thanks,<br></div>=
<div>Xander</div></div></div></div></div></blockquote></div></div></div></d=
iv></div></blockquote></div></div></div></blockquote></div></div><br></div>=
</div></div></blockquote></div>

--0000000000006df12405bff75904--