From user-return-1180-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Thu Apr 15 03:06:26 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id CA6A31804BB for ; Thu, 15 Apr 2021 05:06:26 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 61563624E4 for ; Thu, 15 Apr 2021 03:06:11 +0000 (UTC) Received: (qmail 62384 invoked by uid 500); 15 Apr 2021 03:06:06 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 62371 invoked by uid 99); 15 Apr 2021 03:06:06 -0000 Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Apr 2021 03:06:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id C389EC0480 for ; Thu, 15 Apr 2021 03:06:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org X-Spam-Flag: NO X-Spam-Score: 3.009 X-Spam-Level: *** X-Spam-Status: No, score=3.009 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, GOOGLE_DOC_SUSP=2.999, HTML_MESSAGE=0.2, SPF_PASS=-0.001, T_FILL_THIS_FORM_SHORT=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024) with ESMTP id Bqu_zSpyBR_m for ; Thu, 15 Apr 2021 03:06:03 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::629; helo=mail-ej1-x629.google.com; envelope-from=emkornfield@gmail.com; receiver= Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 2BEB97FDFD for ; Thu, 15 Apr 2021 03:06:03 +0000 (UTC) Received: by mail-ej1-x629.google.com with SMTP id u21so34547782ejo.13 for ; Wed, 14 Apr 2021 20:06:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:reply-to:from:date:message-id :subject:to:cc; bh=E1eR4WvEamhG+NiTRynyJgSW9sby37C9OTBQEZCSHpw=; b=qGlnwjSuDAgQdHnkISrHwrwDrFyGwucOS7ufvheun1HreqcI7JgY5mDG5BqirGMgpr xmwehrko0HsBvrewyErBoXj+a5ZqkdCNAVuIYogB/NAs3lhBb0ynT07yRDL7nm97xTXG AHNyX3RGDF9nUzDhqjZm8SgntgZWs+5yHzlugAlFW8yILh00eGVM/yGj+M5gKOSCAroV qhyQtpe2LT0LjqUzXJDd8sp27GbaZsU9Iz4h1RHDk/3Lt/JQRhIU54awD8hyvCViOriK pID1DpxSEG5BFA9/kIMsdLLTF53v6YIYMZF+uJuJKgeePpqpoR7j6qifxNAnD3R7NVV5 XK0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to:cc; bh=E1eR4WvEamhG+NiTRynyJgSW9sby37C9OTBQEZCSHpw=; b=gNCbSIct3x+OpYNmrOyo+hCnXR6fcrrjEC9Mdt9IpSmlaXCJwd+N+aIugdTlNRmTj7 YpgaR+ANZYZ6/qHqwNpzcxFdn/X3m41S4RlVO0upzyntKVZcbGEKdrdQ0hwjm4vWGnGo m20JzwTZDKmEwohmMujkJKQsiBJH/1sCSG5Ot6pvimEr6HoaPs6+DJbjdyQN3DDtc1bn A4cuDBrfBKgT3GJKBzVgkjX0QxKBHoQQsHv3zS1bo3gPEyqGI0+i5nkkDUAB0CH1rRhX apiual+W2rCbOwqyX85Xax5hJ7g8JPv+11gyrjM1fo9Y5lzjkTq/i7mveBItjrMPinlM L3zQ== X-Gm-Message-State: AOAM531a0xCfD5tDUehVvOiFgvvI0AXmKnsCB7bWf1C9Pq+9b7d1W7Nu sMdGvA4NBAUGCwHBQUHEpf2jI2Zy+qGkfEbhBYk= X-Google-Smtp-Source: ABdhPJx1+PMect/LMqHos711f538S2fjEJRbUrhGOzMiiAGKzXz7mpRXM8Z9vqiBd12R/tsC2wNArXa9pfpRbXfLitg= X-Received: by 2002:a17:906:e251:: with SMTP id gq17mr1160180ejb.361.1618455956118; Wed, 14 Apr 2021 20:05:56 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Reply-To: emkornfield@gmail.com From: Micah Kornfield Date: Wed, 14 Apr 2021 20:05:44 -0700 Message-ID: Subject: Re: [Cython] Getting and Comparing String Scalars? To: Weston Pace Cc: user@arrow.apache.org Content-Type: multipart/alternative; boundary="0000000000005961f405bffa24dd" --0000000000005961f405bffa24dd Content-Type: text/plain; charset="UTF-8" One more thought, at 7TB your data starts entering the realm of "big". You might consider doing the grouping with a distributed compute framework outputting then outputting to plasma. On Wed, Apr 14, 2021 at 7:39 PM Micah Kornfield wrote: > +1 to everything Weston said. > > From your comments about Gandiva, it sounded like you were OK with the > filter based approach but maybe you had a different idea of using Gandiva? > > I believe "filter" also has optimizations if your data was already mostly > grouped by name. > > I agree algorithmically, the map approach is probably optimal but as > Weston alluded to there are hidden constant overheads that might even out > with different approaches. > > Also if your data is already grouped using the dataset expression might be > fairly efficient since it does row group pruning based on predicates on the > underlying parquet data. > > -Micah > > On Wed, Apr 14, 2021 at 7:13 PM Weston Pace wrote: > >> Correct, the "group by" operation you're looking for doesn't quite exist >> (externally) yet (others can correct me if I'm wrong here). ARROW-3978 >> sometimes gets >> brought up in reference to this. There are some things (e.g. C++ query >> execution engine >> ) >> in the works which would provide this. There is also an internal >> implementation (arrow::compute::internal::Grouper) that is used for >> computing partitions but I believe it was intentionally kept internal, >> others may be able to explain more the reason. >> >> Expressions are (or will soon be) built on compute so using them is >> unable to provide much benefit over what is in compute. I want to say the >> best approach you could get in what is in compute for 4.0.0 is O(num_rows * >> num_unique_names). To create the mask you would use the equals function. >> So the whole operation would be... >> >> 1) Use unique to get the possible string values >> 2) For each string value >> a) Use equals to get a mask >> b) Use filter to get a subarray >> >> So what you have may be a pretty reasonable workaround. I'd recommend >> comparing with what you get from compute just for the sake of comparison. >> >> So there are a few minor optimizations you can make that shouldn't be too >> much harder. You want to avoid GetScalar if you can as it will make an >> allocation / copy for every item you access. Grab the column from the >> record batch and cast it to the appropriate typed array (this is only easy >> because it appears you have a fairly rigid schema). This will allow you to >> access values directly without wrapping them in a scalar. For example, in >> C++ (I'll leave the cython to you :)) it would look like... >> >> auto arr = >> std::dynamic_pointer_cast(record_batch->column(0)); >> std::cout << arr->Value(0) << std::endl; >> >> For the string array I believe it is... >> >> auto str_arr = >> std::dynamic_pointer_cast(record_batch->column(0)); >> arrow::util::string_view view = arr->GetView(0); >> >> It may take a slight bit of finesse to figure out how to get >> arrow::util::string_view to work with map but it should be doable. There >> is also GetString which returns std::string which should only be slightly >> more expensive and GetValue which returns a uint8_t* and writes the length >> into an out parameter. >> >> On Wed, Apr 14, 2021 at 3:15 PM Xander Dunn wrote: >> >>> Thanks, I did try a few things with pyarrow.compute. However, the >>> pyarrow.compute.filter interface indicates that it takes a boolean mask to >>> do the filtering: >>> https://arrow.apache.org/docs/python/generated/pyarrow.compute.filter.html >>> >>> But it doesn't actually help me create the mask? I'm back to iterating >>> through the rows and now I would need to create a boolean array of size >>> (num_rows) for every unique value of `name`. >>> >>> I saw in the dataset docs ( >>> https://arrow.apache.org/docs/python/dataset.html) some discussion on >>> Expressions, such as `ds.field("name") == "Xander"`. However, I don't see a >>> way of computing such an expression without loading the entire dataset into >>> memory with `dataset.to_table()`, which doesn't work for my dataset because >>> it's many times larger than RAM. Can an Expression be computed on a >>> RecordBatch? >>> >>> But it's also hard to foresee how applying filter for each unique value >>> of `name` will be more computationally efficient. The loop I posted above >>> is O(num_rows), whereas applying filter for each name would be O(num_rows * >>> num_unique_names). It could still be faster if my loop code is poorly >>> implemented or if filter is multi-threaded. >>> >>> >>> On Wed, Apr 14, 2021 at 4:45 PM, Micah Kornfield >>> wrote: >>> >>>> Have you looked at the pyarrow compute functions [1][2]? >>>> >>>> Unique and filter seems like they would help. >>>> >>>> [1] >>>> https://arrow.apache.org/docs/python/compute.html?highlight=pyarrow%20compute >>>> [2] >>>> https://arrow.apache.org/docs/cpp/compute.html#compute-function-list >>>> >>>> On Wed, Apr 14, 2021 at 2:02 PM Xander Dunn wrote: >>>> >>>>> Thanks Weston, >>>>> >>>>> Performance is paramount here, I'm streaming through 7TB of data. >>>>> >>>>> I actually need to separate the data based on the value of the `name` >>>>> column. For every unique value of `name`, I need a batch of those rows. I >>>>> tried using gandiva's filter function but can't get gandiva installed on >>>>> Ubuntu (see my earlier thread "[Python] pyarrow.gandiva unavailable >>>>> on Ubuntu?" on this mailing list). >>>>> >>>>> Aside from that, I'm not sure of a way to separate the data faster >>>>> than iterating through every row and placing the values into a map keyed on >>>>> `name`: >>>>> ``` >>>>> cdef struct myUpdateStruct: >>>>> double value >>>>> int64_t checksum >>>>> >>>>> cdef iterate_dataset(): >>>>> cdef map[c_string, deque[myUpdateStruct]] myUpdates >>>>> cdef shared_ptr[CRecordBatch] batch # This is populated by a >>>>> scanner of .parquet files >>>>> cdef int64_t batch_row_index = 0 >>>>> while batch_row_index < batch.get().num_rows(): >>>>> name_buffer = (GetResultValue(names.get().\ >>>>> GetScalar(batch_row_index)).get()).value >>>>> name = name_buffer.get().data() >>>>> value = (GetResultValue(values.get().\ >>>>> GetScalar(batch_row_index)).get()).value >>>>> checksum = (GetResultValue(checksums.get().\ >>>>> GetScalar(batch_row_index)).get()).value >>>>> newUpdate = myUpdateStruct(value, checksum) >>>>> if myUpdates.count(name) <= 0: >>>>> myUpdates[name] = deque[myUpdateStruct]() >>>>> myUpdates[name].push_front(newUpdate) >>>>> if myUpdates[name].size() > 1024: >>>>> myUpdates[name].pop_back() >>>>> batch_row_index += 1 >>>>> ``` >>>>> This takes 107minutes to iterate through the first 290GB of data. >>>>> Without accessing or filtering the data in any way it takes only 12min to >>>>> read all the .parquet files into RecordBatches and place them into Plasma. >>>>> >>>>> >>>>> On Wed, Apr 14, 2021 at 12:57 PM, Weston Pace >>>>> wrote: >>>>> >>>>>> If you don't need the performance, you could stay in python (use >>>>>> to_pylist() for the array or as_py() for scalars). >>>>>> >>>>>> If you do need the performance then you're probably better served >>>>>> getting the buffers and operating on them directly. Or, even better, >>>>>> making use of the compute kernels: >>>>>> >>>>>> arr = pa.array(['abc', 'ab', 'Xander', None], pa.string()) >>>>>> desired = pa.array(['Xander'], pa.string()) >>>>>> pc.any(pc.is_in(arr, value_set=desired)).as_py() # True >>>>>> >>>>>> On Wed, Apr 14, 2021 at 6:29 AM Xander Dunn wrote: >>>>>> >>>>>>> This works for getting a c string out of the CScalar: >>>>>>> ``` >>>>>>> name_buffer = >>>>>>> (GetResultValue(names.get().\ >>>>>>> GetScalar(batch_row_index)).get()).value >>>>>>> name = name_buffer.get().data() >>>>>>> ``` >>>>>>> >>>>>>> >>>>>>> On Tue, Apr 13, 2021 at 10:43 PM, Xander Dunn >>>>>>> wrote: >>>>>>> >>>>>>>> Here is an example code snippet from a .pyx file that successfully >>>>>>>> iterates through a CRecordBatch and ensures that the timestamps are >>>>>>>> ascending: >>>>>>>> ``` >>>>>>>> while batch_row_index < batch.get().num_rows(): >>>>>>>> timestamp = >>>>>>>> GetResultValue(times.get().GetScalar(batch_row_index)) >>>>>>>> new_timestamp = timestamp.get() >>>>>>>> current_timestamp = timestamps[name] >>>>>>>> if current_timestamp > new_timestamp.value: >>>>>>>> abort() >>>>>>>> batch_row_index += 1 >>>>>>>> ``` >>>>>>>> >>>>>>>> However, I'm having difficulty operating on the values in a column >>>>>>>> of string type. Unlike CTimestampScalar, there is no CStringScalar. >>>>>>>> Although there is a StringScalar type in C++, it isn't defined in the >>>>>>>> Cython interface. There is a `CStringType` and a `c_string` type. >>>>>>>> ``` >>>>>>>> while batch_row_index < batch.get().num_rows(): >>>>>>>> name = >>>>>>>> GetResultValue(names.get().GetScalar(batch_row_index)) >>>>>>>> name_string = name.get() # This is wrong >>>>>>>> printf("%s\n", name_string) # This prints garbage >>>>>>>> if name_string == b"Xander": # Doesn't work >>>>>>>> print("found it") >>>>>>>> batch_row_index += 1 >>>>>>>> ``` >>>>>>>> How do I get the string value as a C type and compare it to other >>>>>>>> strings? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Xander >>>>>>>> >>>>>>> >>> --0000000000005961f405bffa24dd Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
One more thought, at 7TB your data starts entering the rea= lm of "big".=C2=A0 You might consider doing the grouping with a d= istributed compute framework outputting then outputting=C2=A0to plasma.
On W= ed, Apr 14, 2021 at 7:39 PM Micah Kornfield <emkornfield@gmail.com> wrote:
+1 to everything Weston= said.

From your comments about Gandiva, it sounded = like you were OK with the filter based approach but maybe you had a differe= nt idea of using Gandiva?

I believe "filter&q= uot; also has optimizations if your data was already mostly grouped by name= .=C2=A0

I agree algorithmically, the map approach = is probably optimal but as Weston alluded to there are hidden constant over= heads that might even out with different approaches.

Also if your data is already grouped using the dataset expression might = be fairly efficient since it does row group pruning based on predicates on = the underlying parquet data.

-Micah
On Wed, A= pr 14, 2021 at 7:13 PM Weston Pace <weston.pace@gmail.com> wrote:
Correct, = the "group by" operation you're looking for doesn't quite= exist (externally) yet (others can correct me if I'm wrong here). ARROW-3978 sometimes gets brought= up in reference to this.=C2=A0 There are some things (e.g. C++ query execution engine) in t= he works which would provide this.=C2=A0 There is also an internal implemen= tation (arrow::compute::internal::Grouper) that is used for computing parti= tions but I believe it was intentionally kept internal, others may be able = to explain more the reason.

Expressions are (or wi= ll soon be) built on compute so using them is unable to provide much benefi= t over what is in compute.=C2=A0 I want to say the best approach you could = get in what is in compute for 4.0.0 is O(num_rows * num_unique_names).=C2= =A0 To create the mask you would use the equals function.=C2=A0 So the whol= e operation would be...

1) Use unique to get the p= ossible string values
2) For each string value
=C2=A0 a= ) Use equals to get a mask
=C2=A0 b) Use filter to get a subarray=

So what you have may be a pretty reasonable w= orkaround.=C2=A0 I'd recommend comparing with what you get from compute= just for the sake of comparison.

So there are a f= ew minor optimizations you can make that shouldn't be too much harder.= =C2=A0 You want to avoid GetScalar if you can as it will make an allocation= / copy for every item you access.=C2=A0 Grab the column from the record ba= tch and cast it to the appropriate typed array (this is only easy because i= t appears you have a fairly rigid schema).=C2=A0 This will allow you to acc= ess values directly without wrapping them in a scalar.=C2=A0 For example, i= n C++ (I'll leave the cython to you :)) it would look like...

=C2=A0 auto arr =3D std::dynamic_pointer_cast<arrow::Dou= bleArray>(record_batch->column(0));
=C2=A0 std::cout << arr-= >Value(0) << std::endl;

For the string ar= ray I believe it is...

auto str_arr =3D std::dynam= ic_pointer_cast<arrow::StringArray>(record_batch->column(0));
a= rrow::util::string_view view =3D arr->GetView(0);

It may take a slight bit of finesse to figure out how to get arrow::util= ::string_view to work with map but it should be doable.=C2=A0 There is also= GetString which returns std::string which should only be slightly more exp= ensive and GetValue which returns a uint8_t* and writes the length into an = out parameter.

On Wed, Apr 14, 2021 at 3:15 PM Xander Dunn <xander@xander.ai>= wrote:
Thanks, I did try a few things with pyarrow.compute. Howev= er, the pyarrow.compute.filter interface indicates that it takes a boolean = mask to do the filtering: https://arrow.apach= e.org/docs/python/generated/pyarrow.compute.filter.html
<= br>
But it doesn't actually help me create the mask? I'm = back to iterating through the rows and now I would need to create a boolean= array of size (num_rows) for every unique value of `name`.
<= br>
I saw in the dataset docs (https://arrow.apache.org/docs= /python/dataset.html) some discussion on Expressions, such as `ds.field= ("name") =3D=3D "Xander"`. However, I don't see a w= ay of computing such an expression without loading the entire dataset into = memory with `dataset.to_table()`, which doesn't work for my dataset bec= ause it's many times larger than RAM. Can an Expression be computed on = a RecordBatch?

But it's also hard to fores= ee how applying filter for each unique value of `name` will be more computa= tionally efficient. The loop I posted above is O(num_rows), whereas applyin= g filter for each name would be O(num_rows * num_unique_names). It could st= ill be faster if my loop code is poorly implemented or if filter is multi-t= hreaded.
3D""


On We= d, Apr 14, 2021 at 4:45 PM, Micah Kornfield <emkornfield@gmail.com> wrote:
<= div class=3D"gmail_extra">

On Wed, Apr 14, 2= 021 at 2:02 PM Xander Dunn <xander@xander.ai> wrote:
<= div>Thanks Weston,

Performance is paramount he= re, I'm streaming through 7TB of data.

I a= ctually need to separate the data based on the value of the `name` column. = For every unique value of `name`, I need a batch of those rows. I tried usi= ng gandiva's filter function but can't get gandiva installed on Ubu= ntu (see my earlier thread "[Python] pyarrow.gandiva unavailable on U= buntu?" on this mailing list).=C2=A0

Aside from that, I'm not sure of a way to separate the data faster t= han iterating through every row and placing the values into a map keyed on = `name`:
```
cdef struct myUpdateStruct:
=C2=A0=C2=A0=C2=A0 double value
=C2=A0 =C2=A0 int64_t = checksum

cdef iterate_dataset():
<= div>=C2=A0 =C2=A0 cdef map[c_string, deque[myUpdateStruct]] myUpdates
=C2=A0 =C2=A0 cdef shared_ptr[CRecordBatch] batch # This is popula= ted by a scanner of .parquet files
=C2=A0 =C2=A0 cdef int64_t= batch_row_index =3D 0
=C2=A0=C2=A0=C2=A0 while batch_row_index &= lt; batch.get().num_rows():
=C2=A0 =C2=A0 =C2=A0 =C2=A0 name_= buffer =3D (<CBaseBinaryScalar*>GetResultValue(names.get().\
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 GetScalar(batch_row_index)).get()).value
=C2=A0 =C2=A0 =C2=A0 =C2=A0 name =3D <char *>name_buffer.get().data= ()
=C2=A0 =C2=A0 =C2=A0 =C2=A0 value =3D (<CDoubleScalar*&= gt;GetResultValue(values.get().\
=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 GetScalar(b= atch_row_index)).get()).value
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 checksum =3D (<CInt64Scalar*>GetResultValue(checksums.ge= t().\
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 GetScalar(batch_row_index)).get()).val= ue
=C2=A0 =C2=A0 =C2=A0 =C2=A0 newUpdate =3D myUpdateStruct(v= alue, checksum)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if= myUpdates.count(name) <=3D 0:
=C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 myUpdates[name] =3D deque[myUpdateStruct]()
= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0myUpdates[name].push_front(= newUpdate)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if=C2= =A0myUpdates[name].size() > 1024:
=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0myUpdates[name].pop_back()<= br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 batch_row_index +=3D 1
= ```
This takes 107minutes to iterate through the first 290GB = of data. Without accessing or filtering the data in any way it takes only 1= 2min to read all the .parquet files into RecordBatches and place them into = Plasma.


On Wed, Apr 14, 2021 at 12:57 PM, Weston Pace &= lt;weston.pace@gmail.com> wrote:
If you don't need the performance, y= ou could stay in python (use to_pylist() for the array or as_py() for scala= rs).

If you do need the performance then you'r= e probably better served getting the buffers and operating on them directly= .=C2=A0 Or, even better, making use of the compute kernels:

<= /div>
arr =3D pa.array(['abc', 'ab', 'Xander', = None], pa.string())
desired =3D pa.array(['Xander'], pa.s= tring())
pc.any(pc.is_in(arr, value_set=3Ddesired)).as_py() # Tru= e

On Wed, Apr 14, 2021 at 6:29 AM Xander Dunn <xander@xa= nder.ai> wrote:
This works for getting a c string out of th= e CScalar:
```
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 name_buffer =3D (<CBaseBinaryScalar*>GetResultVa= lue(names.get().\
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 GetScalar(batch_row_index)).get()).value
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 name =3D <char= *>name_buffer.get().data()
```

<= /div>

On Tue, Apr 13, 2021 at 10:4= 3 PM, Xander Dunn <xander@xander.ai> wrote:
<= div>
Here is an example code snippet from a .pyx file that successfully= iterates through a CRecordBatch and ensures that the timestamps are ascend= ing:
```
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 while batch_row_index < batch.get().num_r= ows():
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 timestamp =3D GetResultValue(times.get= ().GetScalar(batch_row_index))
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 new_timestamp = =3D <CTimestampScalar*>timestamp.get()
=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= current_timestamp =3D timestamps[name]
=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if cu= rrent_timestamp > new_timestamp.value:
=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 abort()
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 batch_row_inde= x +=3D 1
```

However, I'm ha= ving difficulty operating on the values in a column of string type. Unlike = CTimestampScalar, there is no CStringScalar. Although there is a StringScal= ar type in C++, it isn't defined in the Cython interface. There is a `C= StringType` and a `c_string` type.
```
=C2=A0= =C2=A0=C2=A0 while batch_row_index < batch.get().num_rows():
=C2=A0 =C2=A0 =C2=A0 =C2=A0 name =3D GetResultValue(names.get().GetScala= r(batch_row_index))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 name_string = =3D <CStringType*>name.get() # This is wrong
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 printf("%s\n", name_string) # This prints garba= ge
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if name_string =3D=3D b"X= ander": # Doesn't work
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 print("found it")
=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 batch_row_index +=3D 1
```
=
How do I get the string value as a C type and compare it to other stri= ngs?=C2=A0

Thanks,
Xander
<= /div>

--0000000000005961f405bffa24dd--