Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Message-ID: <24254622.24981287422603876.JavaMail.jira@thor>
Date: Mon, 18 Oct 2010 13:23:23 -0400 (EDT)
From: "Joydeep Sen Sarma (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Subject: [jira] Commented: (HIVE-1721) use bloom filters to improve the
 performance of map joins
In-Reply-To: <3697658.173151287190475995.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922154#action_12922154 ] 

Joydeep Sen Sarma commented on HIVE-1721:
-----------------------------------------

i am not so sure about this.

consider a hash table which has a very large number of buckets (relative to the size of the elements in the hashtable).  a lookup inside the hashtable stops as soon as we hit an empty bucket. this requires us to only compute the hashcode(). if #buckets >> #elements - then for a miss - the likely average cost of the miss should only be the cost of the hashcode routine.

now consider a bloom filter. here we have to compute multiple hash codes (or at least one). on top of that - with the added bloom filters - there's an added cost for each positive (many hashcode computations). 

It's very clear from this reasoning that Bloom filters would be more expensive, not less, for small table joins. note that the hashtables in java do allow specification of number of buckets - so the strategy outlined here (of deliberately constructing a sparse hash table) is a feasible one.

Stepping back - this makes sense - because Bloom filters are designed for large data sets (or at least data sets that don't easily fit in memory) - not small ones (that fit easily in memory).

---

It would be more interesting to consider Bloom filters to cover join scenarios that cannot be performed with map join. for example - if the small table had 1M keys and map-join is not able to handle that large a hash table - then one can use bloom filters:

- filter (probabilistically) large table against medium sized table by looking up against bloom filter of medium-sized table (map-side bloom filter). (Note - this is not a join - just a filter)
- take filtered output and do sort-merge join against medium sized table (by now the data size should be greatly reduced and the cost of sorting would go down tremendously).

there's lots of literature around this - it's a pretty well known technique. it's quite different from what's proposed in this jira.

> use bloom filters to improve the performance of map joins
> ---------------------------------------------------------
>
>                 Key: HIVE-1721
>                 URL: https://issues.apache.org/jira/browse/HIVE-1721
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Liyin Tang
>
> In case of map-joins, it is likely that the big table will not find many matching rows from the small table.
> Currently, we perform a hash-map lookup for every row in the big table, which can be pretty expensive.
> It might be useful to try out a bloom-filter containing all the elements in the small table.
> Each element from the big table is first searched in the bloom filter, and only in case of a positive match,
> the small table hash table is explored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.