hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: bloom filter used in 0.14?
Date Fri, 29 Jan 2016 01:42:51 GMT
> So I am questioning whether it is enabled on the version I am on, which
>is 0.14. Does anyone know?

https://issues.apache.org/jira/browse/HIVE-9188 - fix-version (1.2.0)


The version you are using does not have bloom filter support.

It should be ignoring the parameter and not generating any bloom filter
streams, when writing.

hive --orcfiledump (in later versions) will print the BLOOM_FILTER as a
column next to the row index streams.

> Without any optimization, I have to use thousands of mappers to find
>just one id.


Everything else you are doing is appropriate, however be aware that the
bloom filter index (& row-index) is consulted only *after* a mapper starts
up.

So it might still spin up a mapper, but it might exit immediately, which
plays well into Tez container reuse for very busy clusters - in fact, it
might be faster in a busy cluster than a completely idle one.

The sorted[1] min-max indicators suggested by Prasanth however are
actually rolled up to the split-level & can be used to prune splits before
being scheduled.

Cheers,
Gopal
[1] - only CLUSTER BY needed, not ORDER BY



Mime
View raw message