arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yue Ni <niyue....@gmail.com>
Subject Re: [C++] Apply Gandiva Filter to a RecordBatch
Date Sun, 05 Apr 2020 03:44:42 GMT
Thanks so much. This is exactly what I was looking for, and I will give it
a try.

BTW, as a follow up question, I wonder if there is anyone has any idea that
if I want to apply both selection and projection to a record batch, is
there any performance difference between these two ways:
1) use filter to get a selection vector and convert it to an array ==> use
arrow::compute::Take to filter the record batch ==> construct a projector
to do projection for the filtered record batch
2) use filter to get a selection vector ==> construct projector to do
projection and apply the selection vector at the same time

The first approach allows me to process a query step by step (first
selection then projection) and the second approach is more concise but
seems not clearly separated compared to the first approach. I prefer the
first approach since it is easier to handle a filtering-only query, but
would like to confirm if it will degrade the performance if both
selection and projection are needed.

On Sun, Apr 5, 2020 at 11:08 AM Wes McKinney <wesmckinn@gmail.com> wrote:

> Try the arrow::compute::Take function
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/take.h#L121
>
> On Sat, Apr 4, 2020 at 8:19 PM Yue Ni <niyue.com@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > Thanks for the reply, but I don't think this is what I am looking for.
> >
> > It seems to me this `result.to_array()` will only return the array for
> the selection vector (
> https://github.com/apache/arrow/blob/b07c2626cb3cdd3498b41da9feedf7c8319baa27/python/pyarrow/gandiva.pyx#L130),
> but it is not clear to me how I can use the selection vector to filter the
> original record batch.
> >
> > If I understand correctly, in this test case,
> `result.to_array().equals(pa.array(range(1000), type=pa.uint32()))` is
> asserting that the selection vector has integer index values from [0,
> 1000), but I am looking for to obtain an array in the filtered record batch
> which should be an array of floats here. I know I can iterate indices in
> the selection vector and use it to retrieve each row in original record
> batch columns, but I am not certain if this is the right way to do it. For
> example, if I have multiple columns in the original record batch, do I need
> to iterate the selection vector multiple times to filter each of the
> column? Since this is a common task, I expect there is an easy/efficient
> API to do this.
> >
> > Basically, I am looking for something like:
> > selection_vector = filter.evaluate(record_batch,
> pa.default_memory_pool())
> > filtered_column_arrays_in_record_batch =
> record_batch.filter(selection_vector) # what is the API for doing this?
> >
> > In the C++ test cases, the closest thing I find is to construct a
> gandiva projector to use the selection vector, but every test case there
> requires client to construct a gandiva expression to build the projector
> (for example, in this test case, a {sum_expr} is used for constructing the
> projector,
> https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_project_test.cc#L121
> ).
> >
> > I wonder if the filtering can be done without involving creating a
> projection expression. At the same time, if projector is expected to be
> used for doing this, what projector expressions should be used if I want to
> keep all the columns as they are but just with some rows filtered based on
> the criteria given?
> >
> > On Sun, Apr 5, 2020 at 7:27 AM Wes McKinney <wesmckinn@gmail.com> wrote:
> >>
> >> You can see an example of filtering via the Python bindings
> >>
> >>
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L89
> >>
> >> This creates a gandiva::Filter using gandiva::Filter::Make, which can
> >> be used to filter a RecordBatch
> >>
> >> Is this what you need?
> >>
> >> On Fri, Apr 3, 2020 at 7:12 PM Yue Ni <niyue.com@gmail.com> wrote:
> >> >
> >> > Hi there,
> >> >
> >> > I am using the gandiva C++ library for processing RecordBatch. I
> would like to know how I can apply gandiva::Filter for a RecordBatch so
> that I can do some filtering without using the projector.
> >> >
> >> > Since I don't find any documentation for it, I read some source code
> about its usage, and here are the test cases I found about its usage:
> >> > 1)
> https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_test.cc
> >> > 2)
> https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_project_test.cc
> >> >
> >> > From my reading, I find it is possible to get a SelectionVector by
> using the gandiva::Filter, at the same time, you can use the
> SelectionVector with the gandiva::Projector to filter RecordBatch when
> doing projection. My questions are:
> >> > 1) if I don't want to do any projection but simply filtering, what is
> the recommended way to do it?
> >> > 2) I am trying to handle the case like "SELECT * FROM table WHERE
> blah", is it recommended to apply filtering without projection in this case
> or is there any alternative approach doing it?
> >> >
> >> > Thanks.
> >> >
> >> > Regards,
> >> > Yue
> >> >
>

Mime
View raw message