orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Issue bloom filters with orc?
Date Mon, 15 Aug 2016 20:08:42 GMT
I have been writing some test code that creates a simple orc writer and
reader with bloom filters enabled.  The issue I have is when the
SearchArgument matches the first column name provided in the Options
searchArgument method (
https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/Reader.java#L197)
the bloom filter doesn't seem to get applied.

The test program creates an orc file file with 2 string columns.  Then it
populates the orc file with 1 million records with same UUID in both
columns, but different values for each row.  Then it performs a series of
reads on the file and counts the number of batches read and displays the
output.

Test program:
https://gist.github.com/amccurry/a25a9dad1e657da5f4a1d8aec5e49118

NOTE: I'm assuming the searchArgument (
https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/Reader.java#L197)
method that contains the columns names are to inform the orc reader what
indexes it should read to perform the search operations.

High Level Output:

where a1 == literal
colNames : ["a1"] reads 977 batches
colNames : ["a1", "a2"] reads 977 batches
colNames : ["a2", "a1"] reads 90 batches

where a2 == literal
colNames : ["a2"] reads 977 batches
colNames : ["a1", "a2"] reads 90 batches
colNames : ["a2", "a1"] reads 977 batches

where a1 == literal AND where a2 == literal
colNames : ["a1", "a2"] reads 90 batches
colNames : ["a2", "a1"] reads 90 batches

where a1 == literal AND where a1 == literal
colNames : ["a1"] reads 977 batches
colNames : ["a1", "a2"] reads 977 batches
colNames : ["a2", "a1"] reads 90 batches

where a2 == literal AND where a2 == literal
colNames : ["a2"] reads 977 batches
colNames : ["a1", "a2"] reads 90 batches
colNames : ["a2", "a1"] reads 977 batches

Given that every row has the same value in both columns a1 and a2 I would
assume that every one of these test runs would yield the same number of
batches read, which should be 90.

Raw Output:
https://gist.github.com/amccurry/962744f35b19bd013ec48c9bcbfb15e4

I think the issue is from mapSargColumnsToOrcInternalColIdx method where
the rootColumn value is hard coded to '0':
https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713

The mapSargColumnsToOrcInternalColIdx method checks each provided column
against the columns in the orc schema.  During this it calls findColumns (
https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L104)
where if the column name matches one of the values in the columnNames
array, the index and rootColumn are added and returned.

Then when the mapSargColumnsToOrcInternalColIdx returns it checks each
value in the filterColumns array to make sure it's value is greater than
'0'.  If the column index is the first column and the rootColumn is '0'
then it's return value is '0' and the logical column filter gets omitted.

I think the rootColumn literal should be '1' instead of '0' (
https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713
).

Thoughts?

Thanks,

Aaron

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message