Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0F65BC787 for ; Sun, 16 Jun 2013 19:25:54 +0000 (UTC) Received: (qmail 94691 invoked by uid 500); 16 Jun 2013 19:25:49 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 94500 invoked by uid 500); 16 Jun 2013 19:25:49 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 94493 invoked by uid 99); 16 Jun 2013 19:25:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Jun 2013 19:25:49 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of john.lilley@redpoint.net designates 206.225.164.218 as permitted sender) Received: from [206.225.164.218] (HELO hub021-nj-3.exch021.serverdata.net) (206.225.164.218) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Jun 2013 19:25:43 +0000 Received: from MBX021-E3-NJ-2.exch021.domain.local ([10.240.4.78]) by HUB021-NJ-3.exch021.domain.local ([10.240.4.36]) with mapi id 14.03.0123.003; Sun, 16 Jun 2013 12:25:22 -0700 From: John Lilley To: "user@hadoop.apache.org" Subject: RE: How to design the mapper and reducer for the following problem Thread-Topic: How to design the mapper and reducer for the following problem Thread-Index: AQHOaQhk2Hde3XzDmUCJO8Bkf7MH15k4uEMQ Date: Sun, 16 Jun 2013 19:25:20 +0000 Message-ID: <869970D71E26D7498BDAC4E1CA92226B658C6F60@MBX021-E3-NJ-2.exch021.domain.local> References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [173.160.43.61] Content-Type: multipart/alternative; boundary="_000_869970D71E26D7498BDAC4E1CA92226B658C6F60MBX021E3NJ2exch_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_869970D71E26D7498BDAC4E1CA92226B658C6F60MBX021E3NJ2exch_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable You basically have a "record similarity scoring and linking" problem -- com= mon in data-quality software like ours. This could be thought of as comput= ing the cross-product of all records, counting the number of hash keys in c= ommon, and then outputting those that exceed a threshold. This is very slo= w for large data because of N-squared size of intermediate data set or at l= east the number of iterations. If you have assurance that the frequency of a given HASH value is low, such= that all instances of records containing a given hash key can fit into mem= ory, it can be done as follows: 1) Mapper1 outputs four tuples with hash as key: {HASH1, DOCID}, {HASH= 2,DOCID},{HASH3,DOCID},{HASH4,DOCID} per input record 2) Reducer1 loads all tuples with same HASH into memory. 3) Reducer1 outputs all tuples { DOCID1, DOCID2, HASH } that share the= hash key (nested loop, but only output where DOCID1 < DOCID2) 4) Mapper2 load tuples from Reducer1 and treats { DOCID1, DOCID2 } as = key 5) Reducer2 counts {DOCID1,DOCID2} instances and outputs DOCID pairs f= or those exceeding threshold. If you have no such assurance, make Mapper1 a map-only job, and replace Red= ucer1 with a new job that joins by HASH. Joins are not standardized in MR = but can be done with MultipleInputs, and of course Pig has this built in. = Searching on "Hadoop join" will give you some ideas of how to implement in = straight MR. John From: parnab kumar [mailto:parnab.2007@gmail.com] Sent: Friday, June 14, 2013 8:06 AM To: user@hadoop.apache.org Subject: How to design the mapper and reducer for the following problem An input file where each line corresponds to a document .Each document is i= dentfied by some fingerPrints .For example a line in the input file is of the following form : input: --------------------- DOCID1 HASH1 HASH2 HASH3 HASH4 DOCID2 HASH5 HASH3 HASH1 HASH4 The output of the mapreduce job should write the pair of DOCIDS which share= a threshold number of HASH in common. output: -------------------------- DOCID1 DOCID2 DOCID3 DOCID5 --_000_869970D71E26D7498BDAC4E1CA92226B658C6F60MBX021E3NJ2exch_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

You basically have a R= 20;record similarity scoring and linking” problem -- common in data-q= uality software like ours.  This could be thought of as computing the cross-product of all records, counting the number of hash keys in common, = and then outputting those that exceed a threshold.  This is very slow = for large data because of N-squared size of intermediate data set or at lea= st the number of iterations.

 <= /p>

If you have assurance tha= t the frequency of a given HASH value is low, such that all instances of re= cords containing a given hash key can fit into memory, it can be done as follows:

1) = ;     Mapper1 outputs f= our tuples with hash as key: {HASH1, DOCID}, {HASH2,DOCID},{HASH3,DOCID},{H= ASH4,DOCID} per input record

2) = ;     Reducer1 loads al= l tuples with same HASH into memory.

3) = ;     Reducer1 outputs = all tuples { DOCID1, DOCID2, HASH } that share the hash key (nested loop, b= ut only output where DOCID1 < DOCID2)

4) = ;     Mapper2 load tupl= es from Reducer1 and treats { DOCID1, DOCID2 } as key

5) = ;     Reducer2 counts {= DOCID1,DOCID2} instances and outputs DOCID pairs for those exceeding thresh= old.

 <= /p>

If you have no such assur= ance, make Mapper1 a map-only job, and replace Reducer1 with a new job that= joins by HASH.  Joins are not standardized in MR but can be done with MultipleInputs, and of course Pig has this built in.  Se= arching on “Hadoop join” will give you some ideas of how to imp= lement in straight MR.

 <= /p>

John

 <= /p>

 <= /p>

From: parnab k= umar [mailto:parnab.2007@gmail.com]
Sent: Friday, June 14, 2013 8:06 AM
To: user@hadoop.apache.org
Subject: How to design the mapper and reducer for the following prob= lem

 

An input file where each line corresponds to a docum= ent .Each document is identfied by some fingerPrints .For example a line in= the input file 

is of the following form :

 

input:

---------------------

DOCID1   HASH1 HASH2 HASH3 HASH4

DOCID2   HASH5 HASH3 HASH1 HASH4

 

The output of the mapreduce job should write the pai= r of DOCIDS which share a threshold number of HASH in common.

 

output:

--------------------------

DOCID1 DOCID2

DOCID3 DOCID5 

--_000_869970D71E26D7498BDAC4E1CA92226B658C6F60MBX021E3NJ2exch_--