hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <sonalgoy...@gmail.com>
Subject Re: Dataset comparison and ranking - views
Date Tue, 08 Mar 2011 05:21:58 GMT
Hi Marcos,

Thanks for replying. I think I was not very clear in my last post. Let me
describe my use case in detail.

I have two datasets coming from different sources, lets call them dataset1
and dataset2. Both of them contain records for entities, say Person. A
single record looks like:

First Name Last Name,  Street, City, State,Zip

We want to compare each record of dataset1 with each record of dataset2, in
effect a cross join.

We know that the way data is collected, names will not match exactly, but we
want to find close enoughs. So we have a rule which says create bigrams and
find the matching bigrams. If 0 to 5 match, give a score of 10, if 5-15
match, give a score of 20 and so on.
For Zip, we have our rule saying exact match or within 5 kms of each
other(through a lookup), give a score of 50 and so on.

Once we have each person of dataset1 compared with that of dataset2, we find
the overall rank. Which is a weighted average of scores of name, address etc

One approach is to use the DistributedCache for the smaller dataset and do a
nested loop join in the mapper. The second approach is to use multiple  MR
flows, and compare the fields and reduce/collate the results.

I am curious to know if people have other approaches they have implemented,
what are the efficiencies they have built up etc.

Thanks and Regards,
<https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
Nube Technologies <http://www.nubetech.co>


On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz <mlortiz@uci.cu> wrote:

> On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
> > Hi,
> >
> > I am working on a problem to compare two different datasets, and rank
> > each record of the first with respect to the other, in terms of how
> > similar they are. The records are dimensional, but do not have a lot
> > of dimensions. Some of the fields will be compared for exact matches,
> > some for similar sound, some with closest match etc. One of the
> > datasets is large, and the other is much smaller.  The final goal is
> > to compute a rank between each record of first dataset with each
> > record of the second. The rank is based on weighted scores of each
> > dimension comparison.
> >
> > I was wondering if people in the community have any advice/suggested
> > patterns/thoughts about cross joining two datasets in map reduce. Do
> > let me know if you have any suggestions.
> >
> > Thanks and Regards,
> > Sonal
> > Hadoop ETL and Data Integration
> > Nube Technologies
> Regards, Sonal. Can you give us more information about a basic workflow
> of your idea?
> Some questions:
> - How do you know that two records are identical? By id?
> - Can you give a example of the ranking that you want to archieve with a
> match of each case:
> - two records that are identical
> - two records that ar similar
> - two records with the closest match
> For MapReduce Design's Algoritms, I recommend to you this excelent from
> Ricky Ho:
> http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html
> For the join of the two datasets, you can use Pig for this. Here you
> have a basic Pig example from Milind Bhandarkar
> (milindb@yahoo-inc.com)'s talk "Practical Problem Solving with Hadoop
> and Pig":
> Users = load ‘users’ as (name, age);
> Filtered = filter Users by age >= 18 and age <= 25;
> Pages = load ‘pages’ as (user, url);
> Joined = join Filtered by name, Pages by user;
> Grouped = group Joined by url;
> Summed = foreach Grouped generate group,
>            COUNT(Joined) as clicks;
> Sorted = order Summed by clicks desc;
> Top5 = limit Sorted 5;
> store Top5 into ‘top5sites’;
> --
>  Marcos Luís Ortíz Valmaseda
>  Software Engineer
>  Centro de Tecnologías de Gestión de Datos (DATEC)
>  Universidad de las Ciencias Informáticas
>  http://uncubanitolinuxero.blogspot.com
>  http://www.linkedin.com/in/marcosluis2186

View raw message