From user-return-1180-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Thu Apr 15 03:06:26 2021
Return-Path: <user-return-1180-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id CA6A31804BB
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 15 Apr 2021 05:06:26 +0200 (CEST)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 61563624E4
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 15 Apr 2021 03:06:11 +0000 (UTC)
Received: (qmail 62384 invoked by uid 500); 15 Apr 2021 03:06:06 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 62371 invoked by uid 99); 15 Apr 2021 03:06:06 -0000
Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Apr 2021 03:06:06 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id C389EC0480
	for <user@arrow.apache.org>; Thu, 15 Apr 2021 03:06:04 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org
X-Spam-Flag: NO
X-Spam-Score: 3.009
X-Spam-Level: ***
X-Spam-Status: No, score=3.009 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, GOOGLE_DOC_SUSP=2.999, HTML_MESSAGE=0.2,
	SPF_PASS=-0.001, T_FILL_THIS_FORM_SHORT=0.01, URIBL_BLOCKED=0.001]
	autolearn=disabled
Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-he-de.apache.org ([116.203.227.195])
	by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024)
	with ESMTP id Bqu_zSpyBR_m for <user@arrow.apache.org>;
	Thu, 15 Apr 2021 03:06:03 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::629; helo=mail-ej1-x629.google.com; envelope-from=emkornfield@gmail.com; receiver=<UNKNOWN> 
Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629])
	by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 2BEB97FDFD
	for <user@arrow.apache.org>; Thu, 15 Apr 2021 03:06:03 +0000 (UTC)
Received: by mail-ej1-x629.google.com with SMTP id u21so34547782ejo.13
        for <user@arrow.apache.org>; Wed, 14 Apr 2021 20:06:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:reply-to:from:date:message-id
         :subject:to:cc;
        bh=E1eR4WvEamhG+NiTRynyJgSW9sby37C9OTBQEZCSHpw=;
        b=qGlnwjSuDAgQdHnkISrHwrwDrFyGwucOS7ufvheun1HreqcI7JgY5mDG5BqirGMgpr
         xmwehrko0HsBvrewyErBoXj+a5ZqkdCNAVuIYogB/NAs3lhBb0ynT07yRDL7nm97xTXG
         AHNyX3RGDF9nUzDhqjZm8SgntgZWs+5yHzlugAlFW8yILh00eGVM/yGj+M5gKOSCAroV
         qhyQtpe2LT0LjqUzXJDd8sp27GbaZsU9Iz4h1RHDk/3Lt/JQRhIU54awD8hyvCViOriK
         pID1DpxSEG5BFA9/kIMsdLLTF53v6YIYMZF+uJuJKgeePpqpoR7j6qifxNAnD3R7NVV5
         XK0w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:reply-to
         :from:date:message-id:subject:to:cc;
        bh=E1eR4WvEamhG+NiTRynyJgSW9sby37C9OTBQEZCSHpw=;
        b=gNCbSIct3x+OpYNmrOyo+hCnXR6fcrrjEC9Mdt9IpSmlaXCJwd+N+aIugdTlNRmTj7
         YpgaR+ANZYZ6/qHqwNpzcxFdn/X3m41S4RlVO0upzyntKVZcbGEKdrdQ0hwjm4vWGnGo
         m20JzwTZDKmEwohmMujkJKQsiBJH/1sCSG5Ot6pvimEr6HoaPs6+DJbjdyQN3DDtc1bn
         A4cuDBrfBKgT3GJKBzVgkjX0QxKBHoQQsHv3zS1bo3gPEyqGI0+i5nkkDUAB0CH1rRhX
         apiual+W2rCbOwqyX85Xax5hJ7g8JPv+11gyrjM1fo9Y5lzjkTq/i7mveBItjrMPinlM
         L3zQ==
X-Gm-Message-State: AOAM531a0xCfD5tDUehVvOiFgvvI0AXmKnsCB7bWf1C9Pq+9b7d1W7Nu
	sMdGvA4NBAUGCwHBQUHEpf2jI2Zy+qGkfEbhBYk=
X-Google-Smtp-Source: ABdhPJx1+PMect/LMqHos711f538S2fjEJRbUrhGOzMiiAGKzXz7mpRXM8Z9vqiBd12R/tsC2wNArXa9pfpRbXfLitg=
X-Received: by 2002:a17:906:e251:: with SMTP id gq17mr1160180ejb.361.1618455956118;
 Wed, 14 Apr 2021 20:05:56 -0700 (PDT)
MIME-Version: 1.0
References: <kngpdtfw.e21d428b-1221-4931-b8f1-14cf43220c11@we.are.superhuman.com>
 <knhno86k.5c013350-16bc-43e3-be86-81656a7c1af3@we.are.superhuman.com>
 <CAE4AYb2TobCXiAJQ9rOsdPR+2ZVbsq8MTsbM8fA+wOzfZGuGGg@mail.gmail.com>
 <knhw4uta.de9e76ca-f695-4b44-8278-02a2f7bf106d@we.are.superhuman.com>
 <CAK7Z5T_6KR5OBOVztc10g5v9nO9AMLboFbPpdE+s_w_s+qrsVg@mail.gmail.com>
 <kni5v41s.0d070fd3-6cb6-4e1f-987e-fd30a38c98cd@we.are.superhuman.com>
 <CAE4AYb0q+Sk6NL7RgA2+BAcvHqbDpqC+t_f1B0pNfSFDuW0hyw@mail.gmail.com> <CAK7Z5T8mmX24p83saOkTwSZDOpqKfLOB9Vq7fjVAjkBC4uVwSw@mail.gmail.com>
In-Reply-To: <CAK7Z5T8mmX24p83saOkTwSZDOpqKfLOB9Vq7fjVAjkBC4uVwSw@mail.gmail.com>
Reply-To: emkornfield@gmail.com
From: Micah Kornfield <emkornfield@gmail.com>
Date: Wed, 14 Apr 2021 20:05:44 -0700
Message-ID: <CAK7Z5T9WKpN75FrO7_+43YPs_MEMDTRO4bvkyTNiV-6EaeahCA@mail.gmail.com>
Subject: Re: [Cython] Getting and Comparing String Scalars?
To: Weston Pace <weston.pace@gmail.com>
Cc: user@arrow.apache.org
Content-Type: multipart/alternative; boundary="0000000000005961f405bffa24dd"

--0000000000005961f405bffa24dd
Content-Type: text/plain; charset="UTF-8"

One more thought, at 7TB your data starts entering the realm of "big".  You
might consider doing the grouping with a distributed compute framework
outputting then outputting to plasma.

On Wed, Apr 14, 2021 at 7:39 PM Micah Kornfield <emkornfield@gmail.com>
wrote:

> +1 to everything Weston said.
>
> From your comments about Gandiva, it sounded like you were OK with the
> filter based approach but maybe you had a different idea of using Gandiva?
>
> I believe "filter" also has optimizations if your data was already mostly
> grouped by name.
>
> I agree algorithmically, the map approach is probably optimal but as
> Weston alluded to there are hidden constant overheads that might even out
> with different approaches.
>
> Also if your data is already grouped using the dataset expression might be
> fairly efficient since it does row group pruning based on predicates on the
> underlying parquet data.
>
> -Micah
>
> On Wed, Apr 14, 2021 at 7:13 PM Weston Pace <weston.pace@gmail.com> wrote:
>
>> Correct, the "group by" operation you're looking for doesn't quite exist
>> (externally) yet (others can correct me if I'm wrong here). ARROW-3978
>> <https://issues.apache.org/jira/browse/ARROW-3978> sometimes gets
>> brought up in reference to this.  There are some things (e.g. C++ query
>> execution engine
>> <https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/edit?usp=sharing>)
>> in the works which would provide this.  There is also an internal
>> implementation (arrow::compute::internal::Grouper) that is used for
>> computing partitions but I believe it was intentionally kept internal,
>> others may be able to explain more the reason.
>>
>> Expressions are (or will soon be) built on compute so using them is
>> unable to provide much benefit over what is in compute.  I want to say the
>> best approach you could get in what is in compute for 4.0.0 is O(num_rows *
>> num_unique_names).  To create the mask you would use the equals function.
>> So the whole operation would be...
>>
>> 1) Use unique to get the possible string values
>> 2) For each string value
>>   a) Use equals to get a mask
>>   b) Use filter to get a subarray
>>
>> So what you have may be a pretty reasonable workaround.  I'd recommend
>> comparing with what you get from compute just for the sake of comparison.
>>
>> So there are a few minor optimizations you can make that shouldn't be too
>> much harder.  You want to avoid GetScalar if you can as it will make an
>> allocation / copy for every item you access.  Grab the column from the
>> record batch and cast it to the appropriate typed array (this is only easy
>> because it appears you have a fairly rigid schema).  This will allow you to
>> access values directly without wrapping them in a scalar.  For example, in
>> C++ (I'll leave the cython to you :)) it would look like...
>>
>>   auto arr =
>> std::dynamic_pointer_cast<arrow::DoubleArray>(record_batch->column(0));
>>   std::cout << arr->Value(0) << std::endl;
>>
>> For the string array I believe it is...
>>
>> auto str_arr =
>> std::dynamic_pointer_cast<arrow::StringArray>(record_batch->column(0));
>> arrow::util::string_view view = arr->GetView(0);
>>
>> It may take a slight bit of finesse to figure out how to get
>> arrow::util::string_view to work with map but it should be doable.  There
>> is also GetString which returns std::string which should only be slightly
>> more expensive and GetValue which returns a uint8_t* and writes the length
>> into an out parameter.
>>
>> On Wed, Apr 14, 2021 at 3:15 PM Xander Dunn <xander@xander.ai> wrote:
>>
>>> Thanks, I did try a few things with pyarrow.compute. However, the
>>> pyarrow.compute.filter interface indicates that it takes a boolean mask to
>>> do the filtering:
>>> https://arrow.apache.org/docs/python/generated/pyarrow.compute.filter.html
>>>
>>> But it doesn't actually help me create the mask? I'm back to iterating
>>> through the rows and now I would need to create a boolean array of size
>>> (num_rows) for every unique value of `name`.
>>>
>>> I saw in the dataset docs (
>>> https://arrow.apache.org/docs/python/dataset.html) some discussion on
>>> Expressions, such as `ds.field("name") == "Xander"`. However, I don't see a
>>> way of computing such an expression without loading the entire dataset into
>>> memory with `dataset.to_table()`, which doesn't work for my dataset because
>>> it's many times larger than RAM. Can an Expression be computed on a
>>> RecordBatch?
>>>
>>> But it's also hard to foresee how applying filter for each unique value
>>> of `name` will be more computationally efficient. The loop I posted above
>>> is O(num_rows), whereas applying filter for each name would be O(num_rows *
>>> num_unique_names). It could still be faster if my loop code is poorly
>>> implemented or if filter is multi-threaded.
>>>
>>>
>>> On Wed, Apr 14, 2021 at 4:45 PM, Micah Kornfield <emkornfield@gmail.com>
>>> wrote:
>>>
>>>> Have you looked at the pyarrow compute functions [1][2]?
>>>>
>>>> Unique and filter seems like they would help.
>>>>
>>>> [1]
>>>> https://arrow.apache.org/docs/python/compute.html?highlight=pyarrow%20compute
>>>> [2]
>>>> https://arrow.apache.org/docs/cpp/compute.html#compute-function-list
>>>>
>>>> On Wed, Apr 14, 2021 at 2:02 PM Xander Dunn <xander@xander.ai> wrote:
>>>>
>>>>> Thanks Weston,
>>>>>
>>>>> Performance is paramount here, I'm streaming through 7TB of data.
>>>>>
>>>>> I actually need to separate the data based on the value of the `name`
>>>>> column. For every unique value of `name`, I need a batch of those rows. I
>>>>> tried using gandiva's filter function but can't get gandiva installed on
>>>>> Ubuntu (see my earlier thread "[Python] pyarrow.gandiva unavailable
>>>>> on Ubuntu?" on this mailing list).
>>>>>
>>>>> Aside from that, I'm not sure of a way to separate the data faster
>>>>> than iterating through every row and placing the values into a map keyed on
>>>>> `name`:
>>>>> ```
>>>>> cdef struct myUpdateStruct:
>>>>>     double value
>>>>>     int64_t checksum
>>>>>
>>>>> cdef iterate_dataset():
>>>>>     cdef map[c_string, deque[myUpdateStruct]] myUpdates
>>>>>     cdef shared_ptr[CRecordBatch] batch # This is populated by a
>>>>> scanner of .parquet files
>>>>>     cdef int64_t batch_row_index = 0
>>>>>     while batch_row_index < batch.get().num_rows():
>>>>>         name_buffer = (<CBaseBinaryScalar*>GetResultValue(names.get().\
>>>>>                 GetScalar(batch_row_index)).get()).value
>>>>>         name = <char *>name_buffer.get().data()
>>>>>         value = (<CDoubleScalar*>GetResultValue(values.get().\
>>>>>                 GetScalar(batch_row_index)).get()).value
>>>>>         checksum = (<CInt64Scalar*>GetResultValue(checksums.get().\
>>>>>                 GetScalar(batch_row_index)).get()).value
>>>>>         newUpdate = myUpdateStruct(value, checksum)
>>>>>         if myUpdates.count(name) <= 0:
>>>>>             myUpdates[name] = deque[myUpdateStruct]()
>>>>>         myUpdates[name].push_front(newUpdate)
>>>>>         if myUpdates[name].size() > 1024:
>>>>>             myUpdates[name].pop_back()
>>>>>         batch_row_index += 1
>>>>> ```
>>>>> This takes 107minutes to iterate through the first 290GB of data.
>>>>> Without accessing or filtering the data in any way it takes only 12min to
>>>>> read all the .parquet files into RecordBatches and place them into Plasma.
>>>>>
>>>>>
>>>>> On Wed, Apr 14, 2021 at 12:57 PM, Weston Pace <weston.pace@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> If you don't need the performance, you could stay in python (use
>>>>>> to_pylist() for the array or as_py() for scalars).
>>>>>>
>>>>>> If you do need the performance then you're probably better served
>>>>>> getting the buffers and operating on them directly.  Or, even better,
>>>>>> making use of the compute kernels:
>>>>>>
>>>>>> arr = pa.array(['abc', 'ab', 'Xander', None], pa.string())
>>>>>> desired = pa.array(['Xander'], pa.string())
>>>>>> pc.any(pc.is_in(arr, value_set=desired)).as_py() # True
>>>>>>
>>>>>> On Wed, Apr 14, 2021 at 6:29 AM Xander Dunn <xander@xander.ai> wrote:
>>>>>>
>>>>>>> This works for getting a c string out of the CScalar:
>>>>>>> ```
>>>>>>>                 name_buffer =
>>>>>>> (<CBaseBinaryScalar*>GetResultValue(names.get().\
>>>>>>>                         GetScalar(batch_row_index)).get()).value
>>>>>>>                 name = <char *>name_buffer.get().data()
>>>>>>> ```
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 13, 2021 at 10:43 PM, Xander Dunn <xander@xander.ai>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here is an example code snippet from a .pyx file that successfully
>>>>>>>> iterates through a CRecordBatch and ensures that the timestamps are
>>>>>>>> ascending:
>>>>>>>> ```
>>>>>>>>             while batch_row_index < batch.get().num_rows():
>>>>>>>>                 timestamp =
>>>>>>>> GetResultValue(times.get().GetScalar(batch_row_index))
>>>>>>>>                 new_timestamp = <CTimestampScalar*>timestamp.get()
>>>>>>>>                 current_timestamp = timestamps[name]
>>>>>>>>                 if current_timestamp > new_timestamp.value:
>>>>>>>>                     abort()
>>>>>>>>                 batch_row_index += 1
>>>>>>>> ```
>>>>>>>>
>>>>>>>> However, I'm having difficulty operating on the values in a column
>>>>>>>> of string type. Unlike CTimestampScalar, there is no CStringScalar.
>>>>>>>> Although there is a StringScalar type in C++, it isn't defined in the
>>>>>>>> Cython interface. There is a `CStringType` and a `c_string` type.
>>>>>>>> ```
>>>>>>>>     while batch_row_index < batch.get().num_rows():
>>>>>>>>         name =
>>>>>>>> GetResultValue(names.get().GetScalar(batch_row_index))
>>>>>>>>         name_string = <CStringType*>name.get() # This is wrong
>>>>>>>>         printf("%s\n", name_string) # This prints garbage
>>>>>>>>         if name_string == b"Xander": # Doesn't work
>>>>>>>>             print("found it")
>>>>>>>>         batch_row_index += 1
>>>>>>>> ```
>>>>>>>> How do I get the string value as a C type and compare it to other
>>>>>>>> strings?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Xander
>>>>>>>>
>>>>>>>
>>>

--0000000000005961f405bffa24dd
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">One more thought, at 7TB your data starts entering the rea=
lm of &quot;big&quot;.=C2=A0 You might consider doing the grouping with a d=
istributed compute framework outputting then outputting=C2=A0to plasma.</di=
v><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On W=
ed, Apr 14, 2021 at 7:39 PM Micah Kornfield &lt;<a href=3D"mailto:emkornfie=
ld@gmail.com">emkornfield@gmail.com</a>&gt; wrote:<br></div><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid =
rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr">+1 to everything Weston=
 said.<br><div><br></div><div>From your comments about Gandiva, it sounded =
like you were OK with the filter based approach but maybe you had a differe=
nt idea of using Gandiva?</div><div><br></div><div>I believe &quot;filter&q=
uot; also has optimizations if your data was already mostly grouped by name=
.=C2=A0</div><div><br></div><div>I agree algorithmically, the map approach =
is probably optimal but as Weston alluded to there are hidden constant over=
heads that might even out with different approaches.</div><div><br></div><d=
iv>Also if your data is already grouped using the dataset expression might =
be fairly efficient since it does row group pruning based on predicates on =
the underlying parquet data.</div><div><br></div><div>-Micah</div></div><br=
><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, A=
pr 14, 2021 at 7:13 PM Weston Pace &lt;<a href=3D"mailto:weston.pace@gmail.=
com" target=3D"_blank">weston.pace@gmail.com</a>&gt; wrote:<br></div><block=
quote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1=
px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div>Correct, =
the &quot;group by&quot; operation you&#39;re looking for doesn&#39;t quite=
 exist (externally) yet (others can correct me if I&#39;m wrong here). <a h=
ref=3D"https://issues.apache.org/jira/browse/ARROW-3978" id=3D"gmail-m_-376=
380789488078967gmail-m_713188998820040509m_-1813664271171015567gmail-key-va=
l" rel=3D"13203315" target=3D"_blank">ARROW-3978</a> sometimes gets brought=
 up in reference to this.=C2=A0 There are some things (e.g. C++ <a href=3D"=
https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1I=
s1tQ/edit?usp=3Dsharing" target=3D"_blank">query execution engine</a>) in t=
he works which would provide this.=C2=A0 There is also an internal implemen=
tation (arrow::compute::internal::Grouper) that is used for computing parti=
tions but I believe it was intentionally kept internal, others may be able =
to explain more the reason.</div><div><br></div><div>Expressions are (or wi=
ll soon be) built on compute so using them is unable to provide much benefi=
t over what is in compute.=C2=A0 I want to say the best approach you could =
get in what is in compute for 4.0.0 is O(num_rows * num_unique_names).=C2=
=A0 To create the mask you would use the equals function.=C2=A0 So the whol=
e operation would be...</div><div><br></div><div>1) Use unique to get the p=
ossible string values</div><div>2) For each string value</div><div>=C2=A0 a=
) Use equals to get a mask</div><div>=C2=A0 b) Use filter to get a subarray=
<br></div><div><br></div><div>So what you have may be a pretty reasonable w=
orkaround.=C2=A0 I&#39;d recommend comparing with what you get from compute=
 just for the sake of comparison.</div><div><br></div><div>So there are a f=
ew minor optimizations you can make that shouldn&#39;t be too much harder.=
=C2=A0 You want to avoid GetScalar if you can as it will make an allocation=
 / copy for every item you access.=C2=A0 Grab the column from the record ba=
tch and cast it to the appropriate typed array (this is only easy because i=
t appears you have a fairly rigid schema).=C2=A0 This will allow you to acc=
ess values directly without wrapping them in a scalar.=C2=A0 For example, i=
n C++ (I&#39;ll leave the cython to you :)) it would look like...</div><div=
><br></div><div>=C2=A0 auto arr =3D std::dynamic_pointer_cast&lt;arrow::Dou=
bleArray&gt;(record_batch-&gt;column(0));<br>=C2=A0 std::cout &lt;&lt; arr-=
&gt;Value(0) &lt;&lt; std::endl;</div><div><br></div><div>For the string ar=
ray I believe it is...</div><div><br></div><div>auto str_arr =3D std::dynam=
ic_pointer_cast&lt;arrow::StringArray&gt;(record_batch-&gt;column(0));<br>a=
rrow::util::string_view view =3D arr-&gt;GetView(0);</div><div><br></div><d=
iv>It may take a slight bit of finesse to figure out how to get arrow::util=
::string_view to work with map but it should be doable.=C2=A0 There is also=
 GetString which returns std::string which should only be slightly more exp=
ensive and GetValue which returns a uint8_t* and writes the length into an =
out parameter.<br></div></div><br><div class=3D"gmail_quote"><div dir=3D"lt=
r" class=3D"gmail_attr">On Wed, Apr 14, 2021 at 3:15 PM Xander Dunn &lt;<a =
href=3D"mailto:xander@xander.ai" target=3D"_blank">xander@xander.ai</a>&gt;=
 wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><di=
v><div><div><div>Thanks, I did try a few things with pyarrow.compute. Howev=
er, the pyarrow.compute.filter interface indicates that it takes a boolean =
mask to do the filtering: <a href=3D"https://arrow.apache.org/docs/python/g=
enerated/pyarrow.compute.filter.html" target=3D"_blank">https://arrow.apach=
e.org/docs/python/generated/pyarrow.compute.filter.html</a><br></div><div><=
br></div><div>But it doesn&#39;t actually help me create the mask? I&#39;m =
back to iterating through the rows and now I would need to create a boolean=
 array of size (num_rows) for every unique value of `name`.<br></div><div><=
br></div><div>I saw in the dataset docs (<a href=3D"https://arrow.apache.or=
g/docs/python/dataset.html" target=3D"_blank">https://arrow.apache.org/docs=
/python/dataset.html</a>) some discussion on Expressions, such as `ds.field=
(&quot;name&quot;) =3D=3D &quot;Xander&quot;`. However, I don&#39;t see a w=
ay of computing such an expression without loading the entire dataset into =
memory with `dataset.to_table()`, which doesn&#39;t work for my dataset bec=
ause it&#39;s many times larger than RAM. Can an Expression be computed on =
a RecordBatch?<br></div><div><br></div><div>But it&#39;s also hard to fores=
ee how applying filter for each unique value of `name` will be more computa=
tionally efficient. The loop I posted above is O(num_rows), whereas applyin=
g filter for each name would be O(num_rows * num_unique_names). It could st=
ill be faster if my loop code is poorly implemented or if filter is multi-t=
hreaded.<br></div></div><div><div style=3D"display:none;border:0px none;wid=
th:0px;height:0px;overflow:hidden"><img src=3D"https://r.superhuman.com/ykU=
T0EH5_c-yWlRqfn55fVYt-jVoisUFxnLw8hQnXOE4417Gg93WacEzh0bWDkPCK5zXW2-_03o0wz=
KxhFwQeb_tYKs5xBg4MjAyD-yVJ5l-6nDuUCUjODgTiQxzFuXYsTbKLyYLRYkSDoXLayCGqxmXB=
NlkNBuuKvCQGR0m0HI2K3MF6EG286k.gif" alt=3D"" style=3D"display: none; border=
: 0px none; width: 0px; height: 0px; overflow: hidden;" width=3D"1" height=
=3D"0"></div><br><div></div></div><br><div><div class=3D"gmail_quote">On We=
d, Apr 14, 2021 at 4:45 PM, Micah Kornfield <span dir=3D"ltr">&lt;<a href=
=3D"mailto:emkornfield@gmail.com" target=3D"_blank">emkornfield@gmail.com</=
a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0=
px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><=
div class=3D"gmail_extra"><div class=3D"gmail_quote" id=3D"gmail-m_-3763807=
89488078967gmail-m_713188998820040509gmail-m_-1813664271171015567gmail-m_-7=
43459220234646358null"><div dir=3D"ltr">Have you looked at the pyarrow comp=
ute functions [1][2]?=C2=A0=C2=A0<div><br></div><div>Unique and filter seem=
s like they would help.<br><div><br></div><div>[1]=C2=A0<a href=3D"https://=
arrow.apache.org/docs/python/compute.html?highlight=3Dpyarrow%20compute" re=
l=3D"noopener noreferrer" target=3D"_blank">https://arrow.apache.org/docs/p=
ython/compute.html?highlight=3Dpyarrow%20compute</a></div><div>[2]=C2=A0<a =
href=3D"https://arrow.apache.org/docs/cpp/compute.html#compute-function-lis=
t" rel=3D"noopener noreferrer" target=3D"_blank">https://arrow.apache.org/d=
ocs/cpp/compute.html#compute-function-list</a></div></div></div><br><div cl=
ass=3D"gmail_quote"><div class=3D"gmail_attr" dir=3D"ltr">On Wed, Apr 14, 2=
021 at 2:02 PM Xander Dunn &lt;<a href=3D"mailto:xander@xander.ai" rel=3D"n=
oopener noreferrer" target=3D"_blank">xander@xander.ai</a>&gt; wrote:<br></=
div><blockquote style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb=
(204,204,204);padding-left:1ex" class=3D"gmail_quote"><div><div><div><div><=
div>Thanks Weston,<br></div><div><br></div><div>Performance is paramount he=
re, I&#39;m streaming through 7TB of data.<br></div><div><br></div><div>I a=
ctually need to separate the data based on the value of the `name` column. =
For every unique value of `name`, I need a batch of those rows. I tried usi=
ng gandiva&#39;s filter function but can&#39;t get gandiva installed on Ubu=
ntu (see my earlier thread &quot;<span style=3D"text-decoration-color:initi=
al;text-decoration-style:initial">[Python] pyarrow.gandiva unavailable on U=
buntu?</span>&quot; on this mailing list).=C2=A0<br></div><div><br></div><d=
iv>Aside from that, I&#39;m not sure of a way to separate the data faster t=
han iterating through every row and placing the values into a map keyed on =
`name`:<br></div><div>```<br></div><div>cdef struct myUpdateStruct:<br></di=
v><div>=C2=A0=C2=A0=C2=A0 double value<br></div><div>=C2=A0 =C2=A0 int64_t =
checksum</div><div><div><br></div><div>cdef iterate_dataset():</div></div><=
div>=C2=A0 =C2=A0 cdef map[c_string, deque[myUpdateStruct]] myUpdates<br></=
div><div>=C2=A0 =C2=A0 cdef shared_ptr[CRecordBatch] batch # This is popula=
ted by a scanner of .parquet files<br></div><div>=C2=A0 =C2=A0 cdef int64_t=
 batch_row_index =3D 0</div><div>=C2=A0=C2=A0=C2=A0 while batch_row_index &=
lt; batch.get().num_rows():<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 name_=
buffer =3D (&lt;CBaseBinaryScalar*&gt;GetResultValue(names.get().\<br></div=
><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 GetScalar(batch_row_index)).get()).value<br></div><di=
v>=C2=A0 =C2=A0 =C2=A0 =C2=A0 name =3D &lt;char *&gt;name_buffer.get().data=
()<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 value =3D (&lt;CDoubleScalar*&=
gt;GetResultValue(values.get().\<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 GetScalar(b=
atch_row_index)).get()).value<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0 checksum =3D (&lt;CInt64Scalar*&gt;GetResultValue(checksums.ge=
t().\<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 GetScalar(batch_row_index)).get()).val=
ue<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 newUpdate =3D myUpdateStruct(v=
alue, checksum)<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if=
 myUpdates.count(name) &lt;=3D 0:<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 myUpdates[name] =3D deque[myUpdateStruct]()<br></div><div>=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0myUpdates[name].push_front(=
newUpdate)<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if=C2=
=A0myUpdates[name].size() &gt; 1024:<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0myUpdates[name].pop_back()<=
br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 batch_row_index +=3D 1</div><div>=
```<br></div><div>This takes 107minutes to iterate through the first 290GB =
of data. Without accessing or filtering the data in any way it takes only 1=
2min to read all the .parquet files into RecordBatches and place them into =
Plasma.<br></div></div><div><br><div></div></div><br><div><div class=3D"gma=
il_quote">On Wed, Apr 14, 2021 at 12:57 PM, Weston Pace <span dir=3D"ltr">&=
lt;<a href=3D"mailto:weston.pace@gmail.com" rel=3D"noopener noreferrer" tar=
get=3D"_blank">weston.pace@gmail.com</a>&gt;</span> wrote:<br><blockquote s=
tyle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pad=
ding-left:1ex" class=3D"gmail_quote"><div class=3D"gmail_extra"><div id=3D"=
gmail-m_-376380789488078967gmail-m_713188998820040509gmail-m_-1813664271171=
015567gmail-m_-743459220234646358gmail-m_6405051516286842027null" class=3D"=
gmail_quote"><div dir=3D"ltr"><div>If you don&#39;t need the performance, y=
ou could stay in python (use to_pylist() for the array or as_py() for scala=
rs).</div><div><br></div><div>If you do need the performance then you&#39;r=
e probably better served getting the buffers and operating on them directly=
.=C2=A0 Or, even better, making use of the compute kernels:</div><div><br><=
/div><div>arr =3D pa.array([&#39;abc&#39;, &#39;ab&#39;, &#39;Xander&#39;, =
None], pa.string())</div><div>desired =3D pa.array([&#39;Xander&#39;], pa.s=
tring())</div><div>pc.any(pc.is_in(arr, value_set=3Ddesired)).as_py() # Tru=
e<br></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"g=
mail_attr">On Wed, Apr 14, 2021 at 6:29 AM Xander Dunn &lt;<a rel=3D"noopen=
er noreferrer" href=3D"mailto:xander@xander.ai" target=3D"_blank">xander@xa=
nder.ai</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"=
margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-lef=
t:1ex"><div><div><div><div><div>This works for getting a c string out of th=
e CScalar:</div><div>```<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 name_buffer =3D (&lt;CBaseBinaryScalar*&gt;GetResultVa=
lue(names.get().\<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 GetScalar(batch_row_index)).get()).value<br></div><di=
v>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 name =3D &lt;char=
 *&gt;name_buffer.get().data()<br></div><div>```</div></div><div><br><div><=
/div></div><br><div><div class=3D"gmail_quote">On Tue, Apr 13, 2021 at 10:4=
3 PM, Xander Dunn <span dir=3D"ltr">&lt;<a rel=3D"noopener noreferrer" href=
=3D"mailto:xander@xander.ai" target=3D"_blank">xander@xander.ai</a>&gt;</sp=
an> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px=
 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=
=3D"gmail_extra"><div class=3D"gmail_quote" id=3D"gmail-m_-3763807894880789=
67gmail-m_713188998820040509gmail-m_-1813664271171015567gmail-m_-7434592202=
34646358gmail-m_6405051516286842027gmail-m_-2113903264806023168null"><div><=
div><div>Here is an example code snippet from a .pyx file that successfully=
 iterates through a CRecordBatch and ensures that the timestamps are ascend=
ing:<br></div><div>```<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 while batch_row_index &lt; batch.get().num_r=
ows():<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 timestamp =3D GetResultValue(times.get=
().GetScalar(batch_row_index))<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 new_timestamp =
=3D &lt;CTimestampScalar*&gt;timestamp.get()<br></div><div>=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 current_timestamp =3D timestamps[name]<br></div><div>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if cu=
rrent_timestamp &gt; new_timestamp.value:<br></div><div>=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 abort()<br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 batch_row_inde=
x +=3D 1<br></div><div>```<br></div><div><br></div><div>However, I&#39;m ha=
ving difficulty operating on the values in a column of string type. Unlike =
CTimestampScalar, there is no CStringScalar. Although there is a StringScal=
ar type in C++, it isn&#39;t defined in the Cython interface. There is a `C=
StringType` and a `c_string` type.<br></div><div>```<br></div><div>=C2=A0=
=C2=A0=C2=A0 while batch_row_index &lt; batch.get().num_rows():<br></div><d=
iv>=C2=A0 =C2=A0 =C2=A0 =C2=A0 name =3D GetResultValue(names.get().GetScala=
r(batch_row_index))<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 name_string =
=3D &lt;CStringType*&gt;name.get() # This is wrong<br></div><div>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 printf(&quot;%s\n&quot;, name_string) # This prints garba=
ge<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if name_string =3D=3D b&quot;X=
ander&quot;: # Doesn&#39;t work<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 print(&quot;found it&quot;)<br></div><div>=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 batch_row_index +=3D 1<br></div><div>```<br></div>=
<div>How do I get the string value as a C type and compare it to other stri=
ngs?=C2=A0<br></div><div><br></div><div>Thanks,<br></div><div>Xander</div><=
/div></div></div></div></blockquote></div></div></div></div></div></blockqu=
ote></div></div></div></blockquote></div></div></div></div></div></blockquo=
te></div></div></div></blockquote></div></div><br></div></div></div></block=
quote></div>
</blockquote></div>
</blockquote></div>

--0000000000005961f405bffa24dd--