hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-1721) use bloom filters to improve the performance of joins
Date Sat, 07 Apr 2012 18:24:15 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249310#comment-13249310
] 

Ashutosh Chauhan commented on HIVE-1721:
----------------------------------------

@Alex,
Reading the previous comments on jira, this is proposed to work as follows:
* Create a local task and launch it on client machine, building a bloom filter on medium-sized
table. (~200MB)
* Create a Common Join MR job and launch it on cluster. Also, ship the bloom filter built
in previous step to all the mapper nodes (via Distributed Cache).
* In Mapper, look-up key of every row of large table in bloom filter. If it exists, then send
that row to reducer, else filter it out.
* In reducer, do the cross-product of rows of different table for a given key to get your
joined output. 

As outlined above, it will be a win since you will be shuffling much less data from mappers
to reducers. Though assumptions are cost of building bloom filter on client machine is small,
there is huge difference in sizes of two tables and the join key is highly selective. One
or more of these assumptions may be wrong in which case there might be a performance loss.
So, there is a trade-off when to use this.

I don't know if there exists a way to compute bloom filter in distributed fashion. If there
is such a way, then you can do the step 1 through a MR job (instead of locally) and on a much
larger table and then launch second MR job to do step 2 & 3. Again, there will be trade-offs
here. 

                
> use bloom filters to improve the performance of joins
> -----------------------------------------------------
>
>                 Key: HIVE-1721
>                 URL: https://issues.apache.org/jira/browse/HIVE-1721
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>              Labels: gsoc, gsoc2012, optimization
>
> In case of map-joins, it is likely that the big table will not find many matching rows
from the small table.
> Currently, we perform a hash-map lookup for every row in the big table, which can be
pretty expensive.
> It might be useful to try out a bloom-filter containing all the elements in the small
table.
> Each element from the big table is first searched in the bloom filter, and only in case
of a positive match,
> the small table hash table is explored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message