From user-return-1179-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Thu Apr 15 02:40:17 2021
Return-Path: <user-return-1179-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 947A61804BB
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 15 Apr 2021 04:40:17 +0200 (CEST)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 98B3C62219
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 15 Apr 2021 02:40:15 +0000 (UTC)
Received: (qmail 14724 invoked by uid 500); 15 Apr 2021 02:40:14 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 14714 invoked by uid 99); 15 Apr 2021 02:40:13 -0000
Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Apr 2021 02:40:13 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 10FD81FF47B
	for <user@arrow.apache.org>; Thu, 15 Apr 2021 02:40:13 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org
X-Spam-Flag: NO
X-Spam-Score: 3.008
X-Spam-Level: ***
X-Spam-Status: No, score=3.008 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, GOOGLE_DOC_SUSP=2.999, HTML_MESSAGE=0.2,
	RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001,
	T_FILL_THIS_FORM_SHORT=0.01, URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamproc1-he-de.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-ec2-va.apache.org ([116.203.227.195])
	by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024)
	with ESMTP id d4-M1JQXqQwv for <user@arrow.apache.org>;
	Thu, 15 Apr 2021 02:40:12 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.208.51; helo=mail-ed1-f51.google.com; envelope-from=emkornfield@gmail.com; receiver=<UNKNOWN> 
Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 6EDD8BD0DC
	for <user@arrow.apache.org>; Thu, 15 Apr 2021 02:40:11 +0000 (UTC)
Received: by mail-ed1-f51.google.com with SMTP id bx20so24984869edb.12
        for <user@arrow.apache.org>; Wed, 14 Apr 2021 19:40:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:reply-to:from:date:message-id
         :subject:to:cc;
        bh=2XVrwhh38qcdqO63DUhHg4Hv7o1LBRTnnO7fxoU0F74=;
        b=OJU+U8gl9Mih4RztZCQ0oawRCouyGyaJVMzkv8zrq4fu4OPyS8Z50NH0T7aq35V7tu
         5LLDL68N+xw6njFVPE2tZRmfDYq3zfA50r8qak+lTsAS6KiG5zkuBsC2G3yoVzrM/eYQ
         iD9pnbuDJbECh5JefSx9my2JVGOYr4J2u5n1S+XYsfYtoZFaNiKE4zfGk+yfHAaAoDSo
         6bGB7JWkS13Fib6w0ih/6xOn1P80sOb9l/nfmRX/H+KXHRnUp3XwcQGdwWgwAwp5ZgfC
         cHD9g296aEcjV9Y6v5ZuuY+GzLmWfWPA4UbAXK0131KWivGYIjWj/9sa2U+gh2yXsEuE
         Yr1Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:reply-to
         :from:date:message-id:subject:to:cc;
        bh=2XVrwhh38qcdqO63DUhHg4Hv7o1LBRTnnO7fxoU0F74=;
        b=g3NDgH5hRKIG/a9MGmUL+1acZTpT6F86zg61zRq+tz2POI18/mOVUl75g+cwcA/cs5
         RBbIFBTwISNUHYN8gVNXi0NUmUDaD82m4VZ0S4CtRwXAcIsNrJZzRfv2/GbMZdBalAu5
         c7RE9VmAFIjt4N0AT/8ODEojRTfRUMk9EzV7eWxaSd4E0A3a+h7Y2SFsuBRQv/zK+txM
         z4bNqCzRH8gUBrYQvAuzZjSueZHWRDVjkGMiLGCWqSY0puPFym+gZATh2KZfZnSkpmz3
         /SI0e0lrWaxIvVKi5K0Yd2e+K+YgzNUIGFZGqpLMdDDDnRifJ4/rYKXGFNtDNnUACbfF
         aq8A==
X-Gm-Message-State: AOAM530Q7Mr6bWURYcU+IXR1aS3x1mgND5S2kGHbUFRmnxwZY2V05Ie6
	SoGXQxOKym7xBuNUR//JfXTNftsAsFWAKkBGHq6Wq4TVHi8=
X-Google-Smtp-Source: ABdhPJwdes5KzhpeNFxWGFE5ytsZewlTxnJSZNA8fzCj5LEVHZ1ZisSUwAHKu0i/hrrCvR8rlLgg4AAI49E2j2urJ6w=
X-Received: by 2002:a05:6402:40c9:: with SMTP id z9mr1452015edb.24.1618454410353;
 Wed, 14 Apr 2021 19:40:10 -0700 (PDT)
MIME-Version: 1.0
References: <kngpdtfw.e21d428b-1221-4931-b8f1-14cf43220c11@we.are.superhuman.com>
 <knhno86k.5c013350-16bc-43e3-be86-81656a7c1af3@we.are.superhuman.com>
 <CAE4AYb2TobCXiAJQ9rOsdPR+2ZVbsq8MTsbM8fA+wOzfZGuGGg@mail.gmail.com>
 <knhw4uta.de9e76ca-f695-4b44-8278-02a2f7bf106d@we.are.superhuman.com>
 <CAK7Z5T_6KR5OBOVztc10g5v9nO9AMLboFbPpdE+s_w_s+qrsVg@mail.gmail.com>
 <kni5v41s.0d070fd3-6cb6-4e1f-987e-fd30a38c98cd@we.are.superhuman.com> <CAE4AYb0q+Sk6NL7RgA2+BAcvHqbDpqC+t_f1B0pNfSFDuW0hyw@mail.gmail.com>
In-Reply-To: <CAE4AYb0q+Sk6NL7RgA2+BAcvHqbDpqC+t_f1B0pNfSFDuW0hyw@mail.gmail.com>
Reply-To: emkornfield@gmail.com
From: Micah Kornfield <emkornfield@gmail.com>
Date: Wed, 14 Apr 2021 19:39:59 -0700
Message-ID: <CAK7Z5T8mmX24p83saOkTwSZDOpqKfLOB9Vq7fjVAjkBC4uVwSw@mail.gmail.com>
Subject: Re: [Cython] Getting and Comparing String Scalars?
To: Weston Pace <weston.pace@gmail.com>
Cc: user@arrow.apache.org
Content-Type: multipart/alternative; boundary="00000000000036de5d05bff9c81a"

--00000000000036de5d05bff9c81a
Content-Type: text/plain; charset="UTF-8"

+1 to everything Weston said.

From your comments about Gandiva, it sounded like you were OK with the
filter based approach but maybe you had a different idea of using Gandiva?

I believe "filter" also has optimizations if your data was already mostly
grouped by name.

I agree algorithmically, the map approach is probably optimal but as Weston
alluded to there are hidden constant overheads that might even out with
different approaches.

Also if your data is already grouped using the dataset expression might be
fairly efficient since it does row group pruning based on predicates on the
underlying parquet data.

-Micah

On Wed, Apr 14, 2021 at 7:13 PM Weston Pace <weston.pace@gmail.com> wrote:

> Correct, the "group by" operation you're looking for doesn't quite exist
> (externally) yet (others can correct me if I'm wrong here). ARROW-3978
> <https://issues.apache.org/jira/browse/ARROW-3978> sometimes gets brought
> up in reference to this.  There are some things (e.g. C++ query execution
> engine
> <https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/edit?usp=sharing>)
> in the works which would provide this.  There is also an internal
> implementation (arrow::compute::internal::Grouper) that is used for
> computing partitions but I believe it was intentionally kept internal,
> others may be able to explain more the reason.
>
> Expressions are (or will soon be) built on compute so using them is unable
> to provide much benefit over what is in compute.  I want to say the best
> approach you could get in what is in compute for 4.0.0 is O(num_rows *
> num_unique_names).  To create the mask you would use the equals function.
> So the whole operation would be...
>
> 1) Use unique to get the possible string values
> 2) For each string value
>   a) Use equals to get a mask
>   b) Use filter to get a subarray
>
> So what you have may be a pretty reasonable workaround.  I'd recommend
> comparing with what you get from compute just for the sake of comparison.
>
> So there are a few minor optimizations you can make that shouldn't be too
> much harder.  You want to avoid GetScalar if you can as it will make an
> allocation / copy for every item you access.  Grab the column from the
> record batch and cast it to the appropriate typed array (this is only easy
> because it appears you have a fairly rigid schema).  This will allow you to
> access values directly without wrapping them in a scalar.  For example, in
> C++ (I'll leave the cython to you :)) it would look like...
>
>   auto arr =
> std::dynamic_pointer_cast<arrow::DoubleArray>(record_batch->column(0));
>   std::cout << arr->Value(0) << std::endl;
>
> For the string array I believe it is...
>
> auto str_arr =
> std::dynamic_pointer_cast<arrow::StringArray>(record_batch->column(0));
> arrow::util::string_view view = arr->GetView(0);
>
> It may take a slight bit of finesse to figure out how to get
> arrow::util::string_view to work with map but it should be doable.  There
> is also GetString which returns std::string which should only be slightly
> more expensive and GetValue which returns a uint8_t* and writes the length
> into an out parameter.
>
> On Wed, Apr 14, 2021 at 3:15 PM Xander Dunn <xander@xander.ai> wrote:
>
>> Thanks, I did try a few things with pyarrow.compute. However, the
>> pyarrow.compute.filter interface indicates that it takes a boolean mask to
>> do the filtering:
>> https://arrow.apache.org/docs/python/generated/pyarrow.compute.filter.html
>>
>> But it doesn't actually help me create the mask? I'm back to iterating
>> through the rows and now I would need to create a boolean array of size
>> (num_rows) for every unique value of `name`.
>>
>> I saw in the dataset docs (
>> https://arrow.apache.org/docs/python/dataset.html) some discussion on
>> Expressions, such as `ds.field("name") == "Xander"`. However, I don't see a
>> way of computing such an expression without loading the entire dataset into
>> memory with `dataset.to_table()`, which doesn't work for my dataset because
>> it's many times larger than RAM. Can an Expression be computed on a
>> RecordBatch?
>>
>> But it's also hard to foresee how applying filter for each unique value
>> of `name` will be more computationally efficient. The loop I posted above
>> is O(num_rows), whereas applying filter for each name would be O(num_rows *
>> num_unique_names). It could still be faster if my loop code is poorly
>> implemented or if filter is multi-threaded.
>>
>>
>> On Wed, Apr 14, 2021 at 4:45 PM, Micah Kornfield <emkornfield@gmail.com>
>> wrote:
>>
>>> Have you looked at the pyarrow compute functions [1][2]?
>>>
>>> Unique and filter seems like they would help.
>>>
>>> [1]
>>> https://arrow.apache.org/docs/python/compute.html?highlight=pyarrow%20compute
>>> [2] https://arrow.apache.org/docs/cpp/compute.html#compute-function-list
>>>
>>> On Wed, Apr 14, 2021 at 2:02 PM Xander Dunn <xander@xander.ai> wrote:
>>>
>>>> Thanks Weston,
>>>>
>>>> Performance is paramount here, I'm streaming through 7TB of data.
>>>>
>>>> I actually need to separate the data based on the value of the `name`
>>>> column. For every unique value of `name`, I need a batch of those rows. I
>>>> tried using gandiva's filter function but can't get gandiva installed on
>>>> Ubuntu (see my earlier thread "[Python] pyarrow.gandiva unavailable on
>>>> Ubuntu?" on this mailing list).
>>>>
>>>> Aside from that, I'm not sure of a way to separate the data faster than
>>>> iterating through every row and placing the values into a map keyed on
>>>> `name`:
>>>> ```
>>>> cdef struct myUpdateStruct:
>>>>     double value
>>>>     int64_t checksum
>>>>
>>>> cdef iterate_dataset():
>>>>     cdef map[c_string, deque[myUpdateStruct]] myUpdates
>>>>     cdef shared_ptr[CRecordBatch] batch # This is populated by a
>>>> scanner of .parquet files
>>>>     cdef int64_t batch_row_index = 0
>>>>     while batch_row_index < batch.get().num_rows():
>>>>         name_buffer = (<CBaseBinaryScalar*>GetResultValue(names.get().\
>>>>                 GetScalar(batch_row_index)).get()).value
>>>>         name = <char *>name_buffer.get().data()
>>>>         value = (<CDoubleScalar*>GetResultValue(values.get().\
>>>>                 GetScalar(batch_row_index)).get()).value
>>>>         checksum = (<CInt64Scalar*>GetResultValue(checksums.get().\
>>>>                 GetScalar(batch_row_index)).get()).value
>>>>         newUpdate = myUpdateStruct(value, checksum)
>>>>         if myUpdates.count(name) <= 0:
>>>>             myUpdates[name] = deque[myUpdateStruct]()
>>>>         myUpdates[name].push_front(newUpdate)
>>>>         if myUpdates[name].size() > 1024:
>>>>             myUpdates[name].pop_back()
>>>>         batch_row_index += 1
>>>> ```
>>>> This takes 107minutes to iterate through the first 290GB of data.
>>>> Without accessing or filtering the data in any way it takes only 12min to
>>>> read all the .parquet files into RecordBatches and place them into Plasma.
>>>>
>>>>
>>>> On Wed, Apr 14, 2021 at 12:57 PM, Weston Pace <weston.pace@gmail.com>
>>>> wrote:
>>>>
>>>>> If you don't need the performance, you could stay in python (use
>>>>> to_pylist() for the array or as_py() for scalars).
>>>>>
>>>>> If you do need the performance then you're probably better served
>>>>> getting the buffers and operating on them directly.  Or, even better,
>>>>> making use of the compute kernels:
>>>>>
>>>>> arr = pa.array(['abc', 'ab', 'Xander', None], pa.string())
>>>>> desired = pa.array(['Xander'], pa.string())
>>>>> pc.any(pc.is_in(arr, value_set=desired)).as_py() # True
>>>>>
>>>>> On Wed, Apr 14, 2021 at 6:29 AM Xander Dunn <xander@xander.ai> wrote:
>>>>>
>>>>>> This works for getting a c string out of the CScalar:
>>>>>> ```
>>>>>>                 name_buffer =
>>>>>> (<CBaseBinaryScalar*>GetResultValue(names.get().\
>>>>>>                         GetScalar(batch_row_index)).get()).value
>>>>>>                 name = <char *>name_buffer.get().data()
>>>>>> ```
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 13, 2021 at 10:43 PM, Xander Dunn <xander@xander.ai>
>>>>>> wrote:
>>>>>>
>>>>>>> Here is an example code snippet from a .pyx file that successfully
>>>>>>> iterates through a CRecordBatch and ensures that the timestamps are
>>>>>>> ascending:
>>>>>>> ```
>>>>>>>             while batch_row_index < batch.get().num_rows():
>>>>>>>                 timestamp =
>>>>>>> GetResultValue(times.get().GetScalar(batch_row_index))
>>>>>>>                 new_timestamp = <CTimestampScalar*>timestamp.get()
>>>>>>>                 current_timestamp = timestamps[name]
>>>>>>>                 if current_timestamp > new_timestamp.value:
>>>>>>>                     abort()
>>>>>>>                 batch_row_index += 1
>>>>>>> ```
>>>>>>>
>>>>>>> However, I'm having difficulty operating on the values in a column
>>>>>>> of string type. Unlike CTimestampScalar, there is no CStringScalar.
>>>>>>> Although there is a StringScalar type in C++, it isn't defined in the
>>>>>>> Cython interface. There is a `CStringType` and a `c_string` type.
>>>>>>> ```
>>>>>>>     while batch_row_index < batch.get().num_rows():
>>>>>>>         name = GetResultValue(names.get().GetScalar(batch_row_index))
>>>>>>>         name_string = <CStringType*>name.get() # This is wrong
>>>>>>>         printf("%s\n", name_string) # This prints garbage
>>>>>>>         if name_string == b"Xander": # Doesn't work
>>>>>>>             print("found it")
>>>>>>>         batch_row_index += 1
>>>>>>> ```
>>>>>>> How do I get the string value as a C type and compare it to other
>>>>>>> strings?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Xander
>>>>>>>
>>>>>>
>>

--00000000000036de5d05bff9c81a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">+1 to everything Weston said.<br><div><br></div><div>From =
your comments about Gandiva, it sounded like you were OK with the filter ba=
sed approach but maybe you had a different idea of using Gandiva?</div><div=
><br></div><div>I believe &quot;filter&quot; also has optimizations if your=
 data was already mostly grouped by name.=C2=A0</div><div><br></div><div>I =
agree algorithmically, the map approach is probably optimal but as Weston a=
lluded to there are hidden constant overheads that might even out with diff=
erent approaches.</div><div><br></div><div>Also if your data is already gro=
uped using the dataset expression might be fairly efficient since it does r=
ow group pruning based on predicates on the underlying parquet data.</div><=
div><br></div><div>-Micah</div></div><br><div class=3D"gmail_quote"><div di=
r=3D"ltr" class=3D"gmail_attr">On Wed, Apr 14, 2021 at 7:13 PM Weston Pace =
&lt;<a href=3D"mailto:weston.pace@gmail.com">weston.pace@gmail.com</a>&gt; =
wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0=
px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=
=3D"ltr"><div>Correct, the &quot;group by&quot; operation you&#39;re lookin=
g for doesn&#39;t quite exist (externally) yet (others can correct me if I&=
#39;m wrong here). <a href=3D"https://issues.apache.org/jira/browse/ARROW-3=
978" id=3D"gmail-m_713188998820040509m_-1813664271171015567gmail-key-val" r=
el=3D"13203315" target=3D"_blank">ARROW-3978</a> sometimes gets brought up =
in reference to this.=C2=A0 There are some things (e.g. C++ <a href=3D"http=
s://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ=
/edit?usp=3Dsharing" target=3D"_blank">query execution engine</a>) in the w=
orks which would provide this.=C2=A0 There is also an internal implementati=
on (arrow::compute::internal::Grouper) that is used for computing partition=
s but I believe it was intentionally kept internal, others may be able to e=
xplain more the reason.</div><div><br></div><div>Expressions are (or will s=
oon be) built on compute so using them is unable to provide much benefit ov=
er what is in compute.=C2=A0 I want to say the best approach you could get =
in what is in compute for 4.0.0 is O(num_rows * num_unique_names).=C2=A0 To=
 create the mask you would use the equals function.=C2=A0 So the whole oper=
ation would be...</div><div><br></div><div>1) Use unique to get the possibl=
e string values</div><div>2) For each string value</div><div>=C2=A0 a) Use =
equals to get a mask</div><div>=C2=A0 b) Use filter to get a subarray<br></=
div><div><br></div><div>So what you have may be a pretty reasonable workaro=
und.=C2=A0 I&#39;d recommend comparing with what you get from compute just =
for the sake of comparison.</div><div><br></div><div>So there are a few min=
or optimizations you can make that shouldn&#39;t be too much harder.=C2=A0 =
You want to avoid GetScalar if you can as it will make an allocation / copy=
 for every item you access.=C2=A0 Grab the column from the record batch and=
 cast it to the appropriate typed array (this is only easy because it appea=
rs you have a fairly rigid schema).=C2=A0 This will allow you to access val=
ues directly without wrapping them in a scalar.=C2=A0 For example, in C++ (=
I&#39;ll leave the cython to you :)) it would look like...</div><div><br></=
div><div>=C2=A0 auto arr =3D std::dynamic_pointer_cast&lt;arrow::DoubleArra=
y&gt;(record_batch-&gt;column(0));<br>=C2=A0 std::cout &lt;&lt; arr-&gt;Val=
ue(0) &lt;&lt; std::endl;</div><div><br></div><div>For the string array I b=
elieve it is...</div><div><br></div><div>auto str_arr =3D std::dynamic_poin=
ter_cast&lt;arrow::StringArray&gt;(record_batch-&gt;column(0));<br>arrow::u=
til::string_view view =3D arr-&gt;GetView(0);</div><div><br></div><div>It m=
ay take a slight bit of finesse to figure out how to get arrow::util::strin=
g_view to work with map but it should be doable.=C2=A0 There is also GetStr=
ing which returns std::string which should only be slightly more expensive =
and GetValue which returns a uint8_t* and writes the length into an out par=
ameter.<br></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" clas=
s=3D"gmail_attr">On Wed, Apr 14, 2021 at 3:15 PM Xander Dunn &lt;<a href=3D=
"mailto:xander@xander.ai" target=3D"_blank">xander@xander.ai</a>&gt; wrote:=
<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8=
ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><div>=
<div><div>Thanks, I did try a few things with pyarrow.compute. However, the=
 pyarrow.compute.filter interface indicates that it takes a boolean mask to=
 do the filtering: <a href=3D"https://arrow.apache.org/docs/python/generate=
d/pyarrow.compute.filter.html" target=3D"_blank">https://arrow.apache.org/d=
ocs/python/generated/pyarrow.compute.filter.html</a><br></div><div><br></di=
v><div>But it doesn&#39;t actually help me create the mask? I&#39;m back to=
 iterating through the rows and now I would need to create a boolean array =
of size (num_rows) for every unique value of `name`.<br></div><div><br></di=
v><div>I saw in the dataset docs (<a href=3D"https://arrow.apache.org/docs/=
python/dataset.html" target=3D"_blank">https://arrow.apache.org/docs/python=
/dataset.html</a>) some discussion on Expressions, such as `ds.field(&quot;=
name&quot;) =3D=3D &quot;Xander&quot;`. However, I don&#39;t see a way of c=
omputing such an expression without loading the entire dataset into memory =
with `dataset.to_table()`, which doesn&#39;t work for my dataset because it=
&#39;s many times larger than RAM. Can an Expression be computed on a Recor=
dBatch?<br></div><div><br></div><div>But it&#39;s also hard to foresee how =
applying filter for each unique value of `name` will be more computationall=
y efficient. The loop I posted above is O(num_rows), whereas applying filte=
r for each name would be O(num_rows * num_unique_names). It could still be =
faster if my loop code is poorly implemented or if filter is multi-threaded=
.<br></div></div><div><div style=3D"display:none;border:0px none;width:0px;=
height:0px;overflow:hidden"><img src=3D"https://r.superhuman.com/ykUT0EH5_c=
-yWlRqfn55fVYt-jVoisUFxnLw8hQnXOE4417Gg93WacEzh0bWDkPCK5zXW2-_03o0wzKxhFwQe=
b_tYKs5xBg4MjAyD-yVJ5l-6nDuUCUjODgTiQxzFuXYsTbKLyYLRYkSDoXLayCGqxmXBNlkNBuu=
KvCQGR0m0HI2K3MF6EG286k.gif" alt=3D"" style=3D"display: none; border: 0px n=
one; width: 0px; height: 0px; overflow: hidden;" width=3D"1" height=3D"0"><=
/div><br><div></div></div><br><div><div class=3D"gmail_quote">On Wed, Apr 1=
4, 2021 at 4:45 PM, Micah Kornfield <span dir=3D"ltr">&lt;<a href=3D"mailto=
:emkornfield@gmail.com" target=3D"_blank">emkornfield@gmail.com</a>&gt;</sp=
an> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px=
 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=
=3D"gmail_extra"><div class=3D"gmail_quote" id=3D"gmail-m_71318899882004050=
9gmail-m_-1813664271171015567gmail-m_-743459220234646358null"><div dir=3D"l=
tr">Have you looked at the pyarrow compute functions [1][2]?=C2=A0=C2=A0<di=
v><br></div><div>Unique and filter seems like they would help.<br><div><br>=
</div><div>[1]=C2=A0<a href=3D"https://arrow.apache.org/docs/python/compute=
.html?highlight=3Dpyarrow%20compute" rel=3D"noopener noreferrer" target=3D"=
_blank">https://arrow.apache.org/docs/python/compute.html?highlight=3Dpyarr=
ow%20compute</a></div><div>[2]=C2=A0<a href=3D"https://arrow.apache.org/doc=
s/cpp/compute.html#compute-function-list" rel=3D"noopener noreferrer" targe=
t=3D"_blank">https://arrow.apache.org/docs/cpp/compute.html#compute-functio=
n-list</a></div></div></div><br><div class=3D"gmail_quote"><div class=3D"gm=
ail_attr" dir=3D"ltr">On Wed, Apr 14, 2021 at 2:02 PM Xander Dunn &lt;<a hr=
ef=3D"mailto:xander@xander.ai" rel=3D"noopener noreferrer" target=3D"_blank=
">xander@xander.ai</a>&gt; wrote:<br></div><blockquote style=3D"margin:0px =
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" clas=
s=3D"gmail_quote"><div><div><div><div><div>Thanks Weston,<br></div><div><br=
></div><div>Performance is paramount here, I&#39;m streaming through 7TB of=
 data.<br></div><div><br></div><div>I actually need to separate the data ba=
sed on the value of the `name` column. For every unique value of `name`, I =
need a batch of those rows. I tried using gandiva&#39;s filter function but=
 can&#39;t get gandiva installed on Ubuntu (see my earlier thread &quot;<sp=
an style=3D"text-decoration-color:initial;text-decoration-style:initial">[P=
ython] pyarrow.gandiva unavailable on Ubuntu?</span>&quot; on this mailing =
list).=C2=A0<br></div><div><br></div><div>Aside from that, I&#39;m not sure=
 of a way to separate the data faster than iterating through every row and =
placing the values into a map keyed on `name`:<br></div><div>```<br></div><=
div>cdef struct myUpdateStruct:<br></div><div>=C2=A0=C2=A0=C2=A0 double val=
ue<br></div><div>=C2=A0 =C2=A0 int64_t checksum</div><div><div><br></div><d=
iv>cdef iterate_dataset():</div></div><div>=C2=A0 =C2=A0 cdef map[c_string,=
 deque[myUpdateStruct]] myUpdates<br></div><div>=C2=A0 =C2=A0 cdef shared_p=
tr[CRecordBatch] batch # This is populated by a scanner of .parquet files<b=
r></div><div>=C2=A0 =C2=A0 cdef int64_t batch_row_index =3D 0</div><div>=C2=
=A0=C2=A0=C2=A0 while batch_row_index &lt; batch.get().num_rows():<br></div=
><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 name_buffer =3D (&lt;CBaseBinaryScalar*&g=
t;GetResultValue(names.get().\<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 GetScalar(batc=
h_row_index)).get()).value<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 name =
=3D &lt;char *&gt;name_buffer.get().data()<br></div><div>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 value =3D (&lt;CDoubleScalar*&gt;GetResultValue(values.get().\<b=
r></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 GetScalar(batch_row_index)).get()).value<br>=
</div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 checksum =3D (&lt;CIn=
t64Scalar*&gt;GetResultValue(checksums.get().\<br></div><div>=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 GetScalar(batch_row_index)).get()).value<br></div><div>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 newUpdate =3D myUpdateStruct(value, checksum)<br></div><div>=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if myUpdates.count(name) &lt;=3D=
 0:<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 myUpdates[name]=
 =3D deque[myUpdateStruct]()<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0myUpdates[name].push_front(newUpdate)<br></div><div>=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if=C2=A0myUpdates[name].size() &gt;=
 1024:<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0myUpdates[name].pop_back()<br></div><div>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 batch_row_index +=3D 1</div><div>```<br></div><div>This takes=
 107minutes to iterate through the first 290GB of data. Without accessing o=
r filtering the data in any way it takes only 12min to read all the .parque=
t files into RecordBatches and place them into Plasma.<br></div></div><div>=
<br><div></div></div><br><div><div class=3D"gmail_quote">On Wed, Apr 14, 20=
21 at 12:57 PM, Weston Pace <span dir=3D"ltr">&lt;<a href=3D"mailto:weston.=
pace@gmail.com" rel=3D"noopener noreferrer" target=3D"_blank">weston.pace@g=
mail.com</a>&gt;</span> wrote:<br><blockquote style=3D"margin:0px 0px 0px 0=
.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class=3D"gmai=
l_quote"><div class=3D"gmail_extra"><div id=3D"gmail-m_713188998820040509gm=
ail-m_-1813664271171015567gmail-m_-743459220234646358gmail-m_64050515162868=
42027null" class=3D"gmail_quote"><div dir=3D"ltr"><div>If you don&#39;t nee=
d the performance, you could stay in python (use to_pylist() for the array =
or as_py() for scalars).</div><div><br></div><div>If you do need the perfor=
mance then you&#39;re probably better served getting the buffers and operat=
ing on them directly.=C2=A0 Or, even better, making use of the compute kern=
els:</div><div><br></div><div>arr =3D pa.array([&#39;abc&#39;, &#39;ab&#39;=
, &#39;Xander&#39;, None], pa.string())</div><div>desired =3D pa.array([=
9;Xander&#39;], pa.string())</div><div>pc.any(pc.is_in(arr, value_set=3Ddes=
ired)).as_py() # True<br></div></div><br><div class=3D"gmail_quote"><div di=
r=3D"ltr" class=3D"gmail_attr">On Wed, Apr 14, 2021 at 6:29 AM Xander Dunn =
&lt;<a rel=3D"noopener noreferrer" href=3D"mailto:xander@xander.ai" target=
=3D"_blank">xander@xander.ai</a>&gt; wrote:<br></div><blockquote class=3D"g=
mail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204=
,204,204);padding-left:1ex"><div><div><div><div><div>This works for getting=
 a c string out of the CScalar:</div><div>```<br></div><div>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 name_buffer =3D (&lt;CBaseBinaryS=
calar*&gt;GetResultValue(names.get().\<br></div><div>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 GetScalar(batch_row_index)).get(=
)).value<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 name =3D &lt;char *&gt;name_buffer.get().data()<br></div><div>```</div>=
</div><div><br><div></div></div><br><div><div class=3D"gmail_quote">On Tue,=
 Apr 13, 2021 at 10:43 PM, Xander Dunn <span dir=3D"ltr">&lt;<a rel=3D"noop=
ener noreferrer" href=3D"mailto:xander@xander.ai" target=3D"_blank">xander@=
xander.ai</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex"><div class=3D"gmail_extra"><div class=3D"gmail_quote" id=3D"gmai=
l-m_713188998820040509gmail-m_-1813664271171015567gmail-m_-7434592202346463=
58gmail-m_6405051516286842027gmail-m_-2113903264806023168null"><div><div><d=
iv>Here is an example code snippet from a .pyx file that successfully itera=
tes through a CRecordBatch and ensures that the timestamps are ascending:<b=
r></div><div>```<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 while batch_row_index &lt; batch.get().num_rows():=
<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 timestamp =3D GetResultValue(times.get().Get=
Scalar(batch_row_index))<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 new_timestamp =3D &l=
t;CTimestampScalar*&gt;timestamp.get()<br></div><div>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 curre=
nt_timestamp =3D timestamps[name]<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if current_=
timestamp &gt; new_timestamp.value:<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 abort()<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 batch_row_index +=3D=
 1<br></div><div>```<br></div><div><br></div><div>However, I&#39;m having d=
ifficulty operating on the values in a column of string type. Unlike CTimes=
tampScalar, there is no CStringScalar. Although there is a StringScalar typ=
e in C++, it isn&#39;t defined in the Cython interface. There is a `CString=
Type` and a `c_string` type.<br></div><div>```<br></div><div>=C2=A0=C2=A0=
=C2=A0 while batch_row_index &lt; batch.get().num_rows():<br></div><div>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 name =3D GetResultValue(names.get().GetScalar(batc=
h_row_index))<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 name_string =3D &lt=
;CStringType*&gt;name.get() # This is wrong<br></div><div>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 printf(&quot;%s\n&quot;, name_string) # This prints garbage<br><=
/div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if name_string =3D=3D b&quot;Xander&q=
uot;: # Doesn&#39;t work<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 print(&quot;found it&quot;)<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0 batch_row_index +=3D 1<br></div><div>```<br></div><div>H=
ow do I get the string value as a C type and compare it to other strings?=
=C2=A0<br></div><div><br></div><div>Thanks,<br></div><div>Xander</div></div=
></div></div></div></blockquote></div></div></div></div></div></blockquote>=
</div></div></div></blockquote></div></div></div></div></div></blockquote><=
/div></div></div></blockquote></div></div><br></div></div></div></blockquot=
e></div>
</blockquote></div>

--00000000000036de5d05bff9c81a--