hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <>
Subject [jira] Commented: (HIVE-1721) use bloom filters to improve the performance of map joins
Date Mon, 18 Oct 2010 17:23:23 GMT


Joydeep Sen Sarma commented on HIVE-1721:

i am not so sure about this.

consider a hash table which has a very large number of buckets (relative to the size of the
elements in the hashtable).  a lookup inside the hashtable stops as soon as we hit an empty
bucket. this requires us to only compute the hashcode(). if #buckets >> #elements -
then for a miss - the likely average cost of the miss should only be the cost of the hashcode

now consider a bloom filter. here we have to compute multiple hash codes (or at least one).
on top of that - with the added bloom filters - there's an added cost for each positive (many
hashcode computations). 

It's very clear from this reasoning that Bloom filters would be more expensive, not less,
for small table joins. note that the hashtables in java do allow specification of number of
buckets - so the strategy outlined here (of deliberately constructing a sparse hash table)
is a feasible one.

Stepping back - this makes sense - because Bloom filters are designed for large data sets
(or at least data sets that don't easily fit in memory) - not small ones (that fit easily
in memory).


It would be more interesting to consider Bloom filters to cover join scenarios that cannot
be performed with map join. for example - if the small table had 1M keys and map-join is not
able to handle that large a hash table - then one can use bloom filters:

- filter (probabilistically) large table against medium sized table by looking up against
bloom filter of medium-sized table (map-side bloom filter). (Note - this is not a join - just
a filter)
- take filtered output and do sort-merge join against medium sized table (by now the data
size should be greatly reduced and the cost of sorting would go down tremendously).

there's lots of literature around this - it's a pretty well known technique. it's quite different
from what's proposed in this jira.

> use bloom filters to improve the performance of map joins
> ---------------------------------------------------------
>                 Key: HIVE-1721
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Liyin Tang
> In case of map-joins, it is likely that the big table will not find many matching rows
from the small table.
> Currently, we perform a hash-map lookup for every row in the big table, which can be
pretty expensive.
> It might be useful to try out a bloom-filter containing all the elements in the small
> Each element from the big table is first searched in the bloom filter, and only in case
of a positive match,
> the small table hash table is explored.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message