arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yue Ni <niyue....@gmail.com>
Subject Re: [C++] Apply Gandiva Filter to a RecordBatch
Date Sun, 05 Apr 2020 01:18:29 GMT
Hi Wes,

Thanks for the reply, but I don't think this is what I am looking for.

It seems to me this `result.to_array()` will only return the array for the
selection vector (
https://github.com/apache/arrow/blob/b07c2626cb3cdd3498b41da9feedf7c8319baa27/python/pyarrow/gandiva.pyx#L130),
but it is not clear to me how I can use the selection vector to filter the
original record batch.

If I understand correctly, in this test case, `result.to_array().equals(pa.
array(range(1000), type=pa.uint32()))` is asserting that the selection
vector has integer index values from [0, 1000), but I am looking for to
obtain an array in the filtered record batch which should be an array of
floats here. I know I can iterate indices in the selection vector and use
it to retrieve each row in original record batch columns, but I am not
certain if this is the right way to do it. For example, if I have multiple
columns in the original record batch, do I need to iterate the selection
vector multiple times to filter each of the column? Since this is a common
task, I expect there is an easy/efficient API to do this.

Basically, I am looking for something like:
selection_vector = filter.evaluate(record_batch, pa.default_memory_pool())
filtered_column_arrays_in_record_batch =
record_batch.filter(selection_vector) # what is the API for doing this?

In the C++ test cases, the closest thing I find is to construct a gandiva
projector to use the selection vector, but every test case there requires
client to construct a gandiva expression to build the projector (for
example, in this test case, a {sum_expr} is used for constructing the
projector,
https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_project_test.cc#L121
).

I wonder if the filtering can be done without involving creating a
projection expression. At the same time, if projector is expected to be
used for doing this, what projector expressions should be used if I want to
keep all the columns as they are but just with some rows filtered based on
the criteria given?

On Sun, Apr 5, 2020 at 7:27 AM Wes McKinney <wesmckinn@gmail.com> wrote:

> You can see an example of filtering via the Python bindings
>
>
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L89
>
> This creates a gandiva::Filter using gandiva::Filter::Make, which can
> be used to filter a RecordBatch
>
> Is this what you need?
>
> On Fri, Apr 3, 2020 at 7:12 PM Yue Ni <niyue.com@gmail.com> wrote:
> >
> > Hi there,
> >
> > I am using the gandiva C++ library for processing RecordBatch. I would
> like to know how I can apply gandiva::Filter for a RecordBatch so that I
> can do some filtering without using the projector.
> >
> > Since I don't find any documentation for it, I read some source code
> about its usage, and here are the test cases I found about its usage:
> > 1)
> https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_test.cc
> > 2)
> https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_project_test.cc
> >
> > From my reading, I find it is possible to get a SelectionVector by using
> the gandiva::Filter, at the same time, you can use the SelectionVector with
> the gandiva::Projector to filter RecordBatch when doing projection. My
> questions are:
> > 1) if I don't want to do any projection but simply filtering, what is
> the recommended way to do it?
> > 2) I am trying to handle the case like "SELECT * FROM table WHERE blah",
> is it recommended to apply filtering without projection in this case or is
> there any alternative approach doing it?
> >
> > Thanks.
> >
> > Regards,
> > Yue
> >
>

Mime
View raw message