orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: Issue bloom filters with orc?
Date Tue, 16 Aug 2016 14:43:54 GMT
Yeah I will work on patch with some test cases.  Thanks.

Aaron

On Mon, Aug 15, 2016 at 9:59 PM, Prasanth J <j.prasanth.j@gmail.com> wrote:

> Hi Aaron
>
> Thanks a lot for reporting the issue and providing test case!
>
> I looked at the test case and I think your solution to offset to
> rootColumn by 1 is correct. It will be good to have this tested with ACID
> as well as the root column for acid will be different.
>
> Would you be willing put up patch for this issue? I will help with the
> review and commit.
>
> Thanks
> Prasanth
>
> > On Aug 15, 2016, at 1:08 PM, Aaron McCurry <amccurry@gmail.com> wrote:
> >
> > I have been writing some test code that creates a simple orc writer and
> > reader with bloom filters enabled.  The issue I have is when the
> > SearchArgument matches the first column name provided in the Options
> > searchArgument method (
> > https://github.com/apache/orc/blob/rel/release-1.1.2/java/
> core/src/java/org/apache/orc/Reader.java#L197)
> > the bloom filter doesn't seem to get applied.
> >
> > The test program creates an orc file file with 2 string columns.  Then it
> > populates the orc file with 1 million records with same UUID in both
> > columns, but different values for each row.  Then it performs a series of
> > reads on the file and counts the number of batches read and displays the
> > output.
> >
> > Test program:
> > https://gist.github.com/amccurry/a25a9dad1e657da5f4a1d8aec5e49118
> >
> > NOTE: I'm assuming the searchArgument (
> > https://github.com/apache/orc/blob/rel/release-1.1.2/java/
> core/src/java/org/apache/orc/Reader.java#L197)
> > method that contains the columns names are to inform the orc reader what
> > indexes it should read to perform the search operations.
> >
> > High Level Output:
> >
> > where a1 == literal
> > colNames : ["a1"] reads 977 batches
> > colNames : ["a1", "a2"] reads 977 batches
> > colNames : ["a2", "a1"] reads 90 batches
> >
> > where a2 == literal
> > colNames : ["a2"] reads 977 batches
> > colNames : ["a1", "a2"] reads 90 batches
> > colNames : ["a2", "a1"] reads 977 batches
> >
> > where a1 == literal AND where a2 == literal
> > colNames : ["a1", "a2"] reads 90 batches
> > colNames : ["a2", "a1"] reads 90 batches
> >
> > where a1 == literal AND where a1 == literal
> > colNames : ["a1"] reads 977 batches
> > colNames : ["a1", "a2"] reads 977 batches
> > colNames : ["a2", "a1"] reads 90 batches
> >
> > where a2 == literal AND where a2 == literal
> > colNames : ["a2"] reads 977 batches
> > colNames : ["a1", "a2"] reads 90 batches
> > colNames : ["a2", "a1"] reads 977 batches
> >
> > Given that every row has the same value in both columns a1 and a2 I would
> > assume that every one of these test runs would yield the same number of
> > batches read, which should be 90.
> >
> > Raw Output:
> > https://gist.github.com/amccurry/962744f35b19bd013ec48c9bcbfb15e4
> >
> > I think the issue is from mapSargColumnsToOrcInternalColIdx method where
> > the rootColumn value is hard coded to '0':
> > https://github.com/apache/orc/blob/rel/release-1.1.2/java/
> core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713
> >
> > The mapSargColumnsToOrcInternalColIdx method checks each provided column
> > against the columns in the orc schema.  During this it calls findColumns
> (
> > https://github.com/apache/orc/blob/rel/release-1.1.2/java/
> core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L104)
> > where if the column name matches one of the values in the columnNames
> > array, the index and rootColumn are added and returned.
> >
> > Then when the mapSargColumnsToOrcInternalColIdx returns it checks each
> > value in the filterColumns array to make sure it's value is greater than
> > '0'.  If the column index is the first column and the rootColumn is '0'
> > then it's return value is '0' and the logical column filter gets omitted.
> >
> > I think the rootColumn literal should be '1' instead of '0' (
> > https://github.com/apache/orc/blob/rel/release-1.1.2/java/
> core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713
> > ).
> >
> > Thoughts?
> >
> > Thanks,
> >
> > Aaron
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message