Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A9A6BDAD1 for ; Wed, 15 Aug 2012 21:23:39 +0000 (UTC) Received: (qmail 36772 invoked by uid 500); 15 Aug 2012 21:23:39 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 36737 invoked by uid 500); 15 Aug 2012 21:23:39 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 36710 invoked by uid 500); 15 Aug 2012 21:23:39 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 36701 invoked by uid 99); 15 Aug 2012 21:23:39 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Aug 2012 21:23:39 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 8A8412C5BED for ; Wed, 15 Aug 2012 21:23:38 +0000 (UTC) Date: Thu, 16 Aug 2012 08:23:38 +1100 (NCT) From: "Siddhartha Gunda (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: <1405087157.16248.1345065818568.JavaMail.jiratomcat@arcas> Subject: [jira] [Updated] (HIVE-1721) use bloom filters to improve the performance of joins MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-1721?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:all-tabpanel ] Siddhartha Gunda updated HIVE-1721: ----------------------------------- Attachment: hive-1721.patch.txt I created some UDF and UDAF functions using which we can create bloom filte= rs and also use it.=20 Sample Ways to use them- STEP 1 : CREATE TEMPORARY FUNCTION bloom AS 'org.apache.hadoop.hive.contrib= .genericudaf.GenericUDAFBuildBloom'; STEP 2 : CREATE TEMPORARY FUNCTION bloom_filter AS 'org.apache.hadoop.hive.= contrib.genericudf.GenericUDFBloomFilter'; STEP 3 : CREATE TABLE 'NameOfBloomFilterTable' as SELECT bloom('HashType', = 'NumElements', 'ProbabilityOfFalsePositives',column1,column2,=E2=80=A6=E2= =80=A6) FROM 'TableName'; =20 'NameOfBloomFilterTable' - Give a name to the table in which bloom filter i= s stored. 'HashType' - Type of hash functions used to build the bloom filter. Its acc= epts two inputs, 'jenkins', 'murmur' 'NumElements' - Number of elements in the table on which the bloom filter i= s being built 'ProbabilityOfFalsePositives' - acceptable probability of false positives. Example : CREATE TABLE tblBloom as SELECT bloom('jenkins', '20', '0.1',id,s= tr) FROM tblOne; =20 STEP 4 : ADD FILE 'PathOfBloomFilterTable'; Example : ADD FILE /user/hive/warehouse/tblbloom40/000000_0;=20 STEP 5 : Sample Use cases=20 SELECT *,bloom_filter('jenkins', '20', '0.1', '000000_0', id, str) FROM Tab= le1; =20 SELECT * FROM Table1 INNER JOIN Table2 ON Table1.id =3D Table2.id WHERE bloom_filter('jenkins', '20', '0.1', '000000_0', Table1.id, Table1.st= r) =20 > use bloom filters to improve the performance of joins > ----------------------------------------------------- > > Key: HIVE-1721 > URL: https://issues.apache.org/jira/browse/HIVE-1721 > Project: Hive > Issue Type: New Feature > Components: Query Processor > Reporter: Namit Jain > Labels: gsoc, gsoc2012, optimization > Attachments: hive-1721.patch.txt > > > In case of map-joins, it is likely that the big table will not find many = matching rows from the small table. > Currently, we perform a hash-map lookup for every row in the big table, w= hich can be pretty expensive. > It might be useful to try out a bloom-filter containing all the elements = in the small table. > Each element from the big table is first searched in the bloom filter, an= d only in case of a positive match, > the small table hash table is explored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs: https://issues.apache.org/jira/secure/ContactAdministrators!default.jsp= a For more information on JIRA, see: http://www.atlassian.com/software/jira