Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of mlortiz@uci.cu designates
 200.55.140.180 as permitted sender)
Subject: Re: Dataset comparison and ranking - views
From: Marcos Ortiz <mlortiz@uci.cu>
To: Sonal Goyal <sonalgoyal4@gmail.com>
Cc: mapreduce-user <mapreduce-user@hadoop.apache.org>
In-Reply-To: <AANLkTinyTyjHbDBPRpcLxqR419n_YWb+Cyk59r7kUG8C@mail.gmail.com>
References: <AANLkTi=9p1dYOM_Gf5CrnX=Yg+BWOjj+9W_HZkU4cvht@mail.gmail.com>
	 <1299525901.3597.20.camel@marcosluis-Aspire-5251>
	 <AANLkTinyTyjHbDBPRpcLxqR419n_YWb+Cyk59r7kUG8C@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Organization: UCI
Date: Tue, 08 Mar 2011 11:26:17 -0430
Message-ID: <1299599777.2214.49.camel@marcosluis-Aspire-5251>
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit

On Tue, 2011-03-08 at 10:51 +0530, Sonal Goyal wrote:
> Hi Marcos,
> 
> Thanks for replying. I think I was not very clear in my last post. Let
> me describe my use case in detail.
> 
> I have two datasets coming from different sources, lets call them
> dataset1 and dataset2. Both of them contain records for entities, say
> Person. A single record looks like:
> 
> First Name Last Name,  Street, City, State,Zip
> 
> We want to compare each record of dataset1 with each record of
> dataset2, in effect a cross join. 
> 
> We know that the way data is collected, names will not match exactly,
> but we want to find close enoughs. So we have a rule which says create
> bigrams and find the matching bigrams. If 0 to 5 match, give a score
> of 10, if 5-15 match, give a score of 20 and so on. 
Well, a approach for this problem has a solution given by Milind
Bhandarkar, on his presentation called "Practical Problem Solving with
Hadoop and Pig".
He talk about a solution for Bigrams giving a example with word
matching.
Bigrams
========

Input: A large text corpus
• Output: List(word , Top (word ))
• Two Stages:
• Generate all possible bigrams
• Find most frequent K bigrams for each word

Bigrams: Stage 1
Map
===
• Generate all possible Bigrams
• Map Input: Large text corpus
• Map computation
• In each sentence, or each “word word ”
• Output (word , word ), (word , word )
• Partition & Sort by (word , word )

pairs.pl
--------
while(<STDIN>) {
chomp;
$_ =~ s/[^a-zA-Z]+/ /g ;
$_ =~ s/^\s+//g ;
$_ =~ s/\s+$//g ;
$_ =~ tr/A-Z/a-z/;
my @words = split(/\s+/, $_);
for (my $i = 0; $i < $#words - 1; ++$i) {
print "$words[$i]:$words[$i+1]\n";
print "$words[$i+1]:$words[$i]\n";
}
}

Bigrams: Stage 1
Reduce
======
• Input: List(word , word ) sorted and partitioned
• Output: List(word , [freq, word ])
• Counting similar to Unigrams example

count.pl
--------

$_ = <STDIN>; chomp;
my ($pw1, $pw2) = split(/:/, $_);
$count = 1;
while(<STDIN>) {
chomp;
my ($w1, $w2) = split(/:/, $_);
if ($w1 eq $pw1 && $w2 eq $pw2) {
$count++;
} else {
print "$pw1:$count:$pw2\n";
$pw1 = $w1;
$pw2 = $w2;
$count = 1;
}
}
print "$pw1:$count:$pw2\n";

Bigrams: Stage 2
Map
===
• Input: List(word , [freq,word ])
• Output: List(word , [freq, word ])
• Identity Mapper (/bin/cat)
• Partition by word
• Sort descending by (word , freq)

Bigrams: Stage 2
Reduce
======
• Input: List(word , [freq,word ])
• partitioned by word
• sorted descending by (word , freq)
• Output: Top (List(word , [freq, word ]))
• For each word, throw away after K records

firstN.pl
$N = 5;
$_ = <STDIN>; chomp;
my ($pw1, $count, $pw2) = split(/:/, $_);
$idx = 1;
$out = "$pw1\t$pw2,$count;";
while(<STDIN>) {
chomp;
my ($w1, $c, $w2) = split(/:/, $_);
if ($w1 eq $pw1) {
if ($idx < $N) {
$out .= "$w2,$c;";
$idx++;
}
} else {
print "$out\n";
$pw1 = $w1;
$idx = 1;
$out = "$pw1\t$w2,$c;";
}
}
print "$out\n";


You can translate this approach to your especific problem.
I recommend you that you discuss this with him because he has a vast
experience with all this, much more than me.

Regards

> For Zip, we have our rule saying exact match or within 5 kms of each
> other(through a lookup), give a score of 50 and so on.
> 
> Once we have each person of dataset1 compared with that of dataset2,
> we find the overall rank. Which is a weighted average of scores of
> name, address etc comparison. 
> 
> One approach is to use the DistributedCache for the smaller dataset
> and do a nested loop join in the mapper. The second approach is to use
> multiple  MR flows, and compare the fields and reduce/collate the
> results. 
> 
> I am curious to know if people have other approaches they have
> implemented, what are the efficiencies they have built up etc.
> 
> Thanks and Regards,
> Sonal
> Hadoop ETL and Data Integration
> Nube Technologies 
> 
> 
> 
> 
> 
> 
> 
> On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz <mlortiz@uci.cu> wrote:
>         
>         On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
>         > Hi,
>         >
>         > I am working on a problem to compare two different datasets,
>         and rank
>         > each record of the first with respect to the other, in terms
>         of how
>         > similar they are. The records are dimensional, but do not
>         have a lot
>         > of dimensions. Some of the fields will be compared for exact
>         matches,
>         > some for similar sound, some with closest match etc. One of
>         the
>         > datasets is large, and the other is much smaller.  The final
>         goal is
>         > to compute a rank between each record of first dataset with
>         each
>         > record of the second. The rank is based on weighted scores
>         of each
>         > dimension comparison.
>         >
>         > I was wondering if people in the community have any
>         advice/suggested
>         > patterns/thoughts about cross joining two datasets in map
>         reduce. Do
>         > let me know if you have any suggestions.
>         >
>         > Thanks and Regards,
>         > Sonal
>         > Hadoop ETL and Data Integration
>         > Nube Technologies
>         
>         
>         Regards, Sonal. Can you give us more information about a basic
>         workflow
>         of your idea?
>         
>         Some questions:
>         - How do you know that two records are identical? By id?
>         - Can you give a example of the ranking that you want to
>         archieve with a
>         match of each case:
>         - two records that are identical
>         - two records that ar similar
>         - two records with the closest match
>         
>         For MapReduce Design's Algoritms, I recommend to you this
>         excelent from
>         Ricky Ho:
>         http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html
>         
>         For the join of the two datasets, you can use Pig for this.
>         Here you
>         have a basic Pig example from Milind Bhandarkar
>         (milindb@yahoo-inc.com)'s talk "Practical Problem Solving with
>         Hadoop
>         and Pig":
>         Users = load ‘users’ as (name, age);
>         Filtered = filter Users by age >= 18 and age <= 25;
>         Pages = load ‘pages’ as (user, url);
>         Joined = join Filtered by name, Pages by user;
>         Grouped = group Joined by url;
>         Summed = foreach Grouped generate group,
>                    COUNT(Joined) as clicks;
>         Sorted = order Summed by clicks desc;
>         Top5 = limit Sorted 5;
>         store Top5 into ‘top5sites’;
>         
>         
>         --
>          Marcos Luís Ortíz Valmaseda
>          Software Engineer
>          Centro de Tecnologías de Gestión de Datos (DATEC)
>          Universidad de las Ciencias Informáticas
>          http://uncubanitolinuxero.blogspot.com
>          http://www.linkedin.com/in/marcosluis2186
>         
>         
> 

-- 
 Marcos Luís Ortíz Valmaseda
 Software Engineer
 Centro de Tecnologías de Gestión de Datos (DATEC)
 Universidad de las Ciencias Informáticas
 http://uncubanitolinuxero.blogspot.com
 http://www.linkedin.com/in/marcosluis2186