Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 73622 invoked from network); 8 Mar 2011 18:12:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Mar 2011 18:12:26 -0000 Received: (qmail 97085 invoked by uid 500); 8 Mar 2011 16:25:46 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 97045 invoked by uid 500); 8 Mar 2011 16:25:46 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 97036 invoked by uid 99); 8 Mar 2011 16:25:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2011 16:25:46 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=SPF_PASS,T_FILL_THIS_FORM_SHORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mlortiz@uci.cu designates 200.55.140.180 as permitted sender) Received: from [200.55.140.180] (HELO mx3.uci.cu) (200.55.140.180) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 08 Mar 2011 16:25:42 +0000 Received: (qmail 12937 invoked by uid 507); 8 Mar 2011 16:25:14 -0000 Received: from 10.0.0.184 by ns3.uci.cu (envelope-from , uid 501) with qmail-scanner-2.01st (avp: 5.0.2.0. spamassassin: 3.0.6. perlscan: 2.01st. Clear:RC:1(10.0.0.184):. Processed in 0.62637 secs); 08 Mar 2011 16:25:14 -0000 Received: from unknown (HELO ucimail3.uci.cu) (10.0.0.184) by 0 with SMTP; 8 Mar 2011 16:25:13 -0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by ucimail3.uci.cu (Postfix) with ESMTP id C7DFC1E8C213; Tue, 8 Mar 2011 11:25:13 -0500 (CST) X-Virus-Scanned: amavisd-new at uci.cu Received: from ucimail3.uci.cu ([127.0.0.1]) by localhost (ucimail3.uci.cu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zuLex8MqsQ0M; Tue, 8 Mar 2011 11:25:09 -0500 (CST) Received: from [10.36.18.44] (marcosluis-aspire-5251.uci.cu [10.36.18.44]) by ucimail3.uci.cu (Postfix) with ESMTP id 55B8A1E8C1AB; Tue, 8 Mar 2011 11:25:09 -0500 (CST) Subject: Re: Dataset comparison and ranking - views From: Marcos Ortiz To: Sonal Goyal Cc: mapreduce-user In-Reply-To: References: <1299525901.3597.20.camel@marcosluis-Aspire-5251> Content-Type: text/plain; charset="UTF-8" Organization: UCI Date: Tue, 08 Mar 2011 11:26:17 -0430 Message-ID: <1299599777.2214.49.camel@marcosluis-Aspire-5251> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Content-Transfer-Encoding: 8bit On Tue, 2011-03-08 at 10:51 +0530, Sonal Goyal wrote: > Hi Marcos, > > Thanks for replying. I think I was not very clear in my last post. Let > me describe my use case in detail. > > I have two datasets coming from different sources, lets call them > dataset1 and dataset2. Both of them contain records for entities, say > Person. A single record looks like: > > First Name Last Name, Street, City, State,Zip > > We want to compare each record of dataset1 with each record of > dataset2, in effect a cross join. > > We know that the way data is collected, names will not match exactly, > but we want to find close enoughs. So we have a rule which says create > bigrams and find the matching bigrams. If 0 to 5 match, give a score > of 10, if 5-15 match, give a score of 20 and so on. Well, a approach for this problem has a solution given by Milind Bhandarkar, on his presentation called "Practical Problem Solving with Hadoop and Pig". He talk about a solution for Bigrams giving a example with word matching. Bigrams ======== Input: A large text corpus • Output: List(word , Top (word )) • Two Stages: • Generate all possible bigrams • Find most frequent K bigrams for each word Bigrams: Stage 1 Map === • Generate all possible Bigrams • Map Input: Large text corpus • Map computation • In each sentence, or each “word word ” • Output (word , word ), (word , word ) • Partition & Sort by (word , word ) pairs.pl -------- while() { chomp; $_ =~ s/[^a-zA-Z]+/ /g ; $_ =~ s/^\s+//g ; $_ =~ s/\s+$//g ; $_ =~ tr/A-Z/a-z/; my @words = split(/\s+/, $_); for (my $i = 0; $i < $#words - 1; ++$i) { print "$words[$i]:$words[$i+1]\n"; print "$words[$i+1]:$words[$i]\n"; } } Bigrams: Stage 1 Reduce ====== • Input: List(word , word ) sorted and partitioned • Output: List(word , [freq, word ]) • Counting similar to Unigrams example count.pl -------- $_ = ; chomp; my ($pw1, $pw2) = split(/:/, $_); $count = 1; while() { chomp; my ($w1, $w2) = split(/:/, $_); if ($w1 eq $pw1 && $w2 eq $pw2) { $count++; } else { print "$pw1:$count:$pw2\n"; $pw1 = $w1; $pw2 = $w2; $count = 1; } } print "$pw1:$count:$pw2\n"; Bigrams: Stage 2 Map === • Input: List(word , [freq,word ]) • Output: List(word , [freq, word ]) • Identity Mapper (/bin/cat) • Partition by word • Sort descending by (word , freq) Bigrams: Stage 2 Reduce ====== • Input: List(word , [freq,word ]) • partitioned by word • sorted descending by (word , freq) • Output: Top (List(word , [freq, word ])) • For each word, throw away after K records firstN.pl $N = 5; $_ = ; chomp; my ($pw1, $count, $pw2) = split(/:/, $_); $idx = 1; $out = "$pw1\t$pw2,$count;"; while() { chomp; my ($w1, $c, $w2) = split(/:/, $_); if ($w1 eq $pw1) { if ($idx < $N) { $out .= "$w2,$c;"; $idx++; } } else { print "$out\n"; $pw1 = $w1; $idx = 1; $out = "$pw1\t$w2,$c;"; } } print "$out\n"; You can translate this approach to your especific problem. I recommend you that you discuss this with him because he has a vast experience with all this, much more than me. Regards > For Zip, we have our rule saying exact match or within 5 kms of each > other(through a lookup), give a score of 50 and so on. > > Once we have each person of dataset1 compared with that of dataset2, > we find the overall rank. Which is a weighted average of scores of > name, address etc comparison. > > One approach is to use the DistributedCache for the smaller dataset > and do a nested loop join in the mapper. The second approach is to use > multiple MR flows, and compare the fields and reduce/collate the > results. > > I am curious to know if people have other approaches they have > implemented, what are the efficiencies they have built up etc. > > Thanks and Regards, > Sonal > Hadoop ETL and Data Integration > Nube Technologies > > > > > > > > On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz wrote: > > On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote: > > Hi, > > > > I am working on a problem to compare two different datasets, > and rank > > each record of the first with respect to the other, in terms > of how > > similar they are. The records are dimensional, but do not > have a lot > > of dimensions. Some of the fields will be compared for exact > matches, > > some for similar sound, some with closest match etc. One of > the > > datasets is large, and the other is much smaller. The final > goal is > > to compute a rank between each record of first dataset with > each > > record of the second. The rank is based on weighted scores > of each > > dimension comparison. > > > > I was wondering if people in the community have any > advice/suggested > > patterns/thoughts about cross joining two datasets in map > reduce. Do > > let me know if you have any suggestions. > > > > Thanks and Regards, > > Sonal > > Hadoop ETL and Data Integration > > Nube Technologies > > > Regards, Sonal. Can you give us more information about a basic > workflow > of your idea? > > Some questions: > - How do you know that two records are identical? By id? > - Can you give a example of the ranking that you want to > archieve with a > match of each case: > - two records that are identical > - two records that ar similar > - two records with the closest match > > For MapReduce Design's Algoritms, I recommend to you this > excelent from > Ricky Ho: > http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html > > For the join of the two datasets, you can use Pig for this. > Here you > have a basic Pig example from Milind Bhandarkar > (milindb@yahoo-inc.com)'s talk "Practical Problem Solving with > Hadoop > and Pig": > Users = load ‘users’ as (name, age); > Filtered = filter Users by age >= 18 and age <= 25; > Pages = load ‘pages’ as (user, url); > Joined = join Filtered by name, Pages by user; > Grouped = group Joined by url; > Summed = foreach Grouped generate group, > COUNT(Joined) as clicks; > Sorted = order Summed by clicks desc; > Top5 = limit Sorted 5; > store Top5 into ‘top5sites’; > > > -- > Marcos Luís Ortíz Valmaseda > Software Engineer > Centro de Tecnologías de Gestión de Datos (DATEC) > Universidad de las Ciencias Informáticas > http://uncubanitolinuxero.blogspot.com > http://www.linkedin.com/in/marcosluis2186 > > > -- Marcos Luís Ortíz Valmaseda Software Engineer Centro de Tecnologías de Gestión de Datos (DATEC) Universidad de las Ciencias Informáticas http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186