hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chase Bradford <chase.bradf...@gmail.com>
Subject Re: Dataset comparison and ranking - views
Date Tue, 08 Mar 2011 17:07:08 GMT
How much smaller is the smaller dataset?  If you can use the DC and precompute bigrams, locations,
etc, and hold all the results in memory during setup before mapping on the large dataset,
then I would suggest that approach.

Another trick I've seen for similar problems where the final score is a product of feature
scores, is to cluster in a way that eliminates obvious 0s.  For example, if distance >
50km is a zero, then choose enough anchor coordinates to canvas the map with circles with
radius 25km and overlap.  Then, your mapper would emit (coord, record) pairs for every anchor
region the record is in.  That way, only records know to be similar in some way are considered.

On Mar 7, 2011, at 9:21 PM, Sonal Goyal <sonalgoyal4@gmail.com> wrote:

> Hi Marcos,
> 
> Thanks for replying. I think I was not very clear in my last post. Let me describe my
use case in detail.
> 
> I have two datasets coming from different sources, lets call them dataset1 and dataset2.
Both of them contain records for entities, say Person. A single record looks like:
> 
> First Name Last Name,  Street, City, State,Zip
> 
> We want to compare each record of dataset1 with each record of dataset2, in effect a
cross join. 
> 
> We know that the way data is collected, names will not match exactly, but we want to
find close enoughs. So we have a rule which says create bigrams and find the matching bigrams.
If 0 to 5 match, give a score of 10, if 5-15 match, give a score of 20 and so on. 
> For Zip, we have our rule saying exact match or within 5 kms of each other(through a
lookup), give a score of 50 and so on.
> 
> Once we have each person of dataset1 compared with that of dataset2, we find the overall
rank. Which is a weighted average of scores of name, address etc comparison. 
> 
> One approach is to use the DistributedCache for the smaller dataset and do a nested loop
join in the mapper. The second approach is to use multiple  MR flows, and compare the fields
and reduce/collate the results. 
> 
> I am curious to know if people have other approaches they have implemented, what are
the efficiencies they have built up etc.
> 
> Thanks and Regards,
> Sonal
> Hadoop ETL and Data Integration
> Nube Technologies 
> 
> 
> 
> 
> 
> 
> 
> On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz <mlortiz@uci.cu> wrote:
> On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
> > Hi,
> >
> > I am working on a problem to compare two different datasets, and rank
> > each record of the first with respect to the other, in terms of how
> > similar they are. The records are dimensional, but do not have a lot
> > of dimensions. Some of the fields will be compared for exact matches,
> > some for similar sound, some with closest match etc. One of the
> > datasets is large, and the other is much smaller.  The final goal is
> > to compute a rank between each record of first dataset with each
> > record of the second. The rank is based on weighted scores of each
> > dimension comparison.
> >
> > I was wondering if people in the community have any advice/suggested
> > patterns/thoughts about cross joining two datasets in map reduce. Do
> > let me know if you have any suggestions.
> >
> > Thanks and Regards,
> > Sonal
> > Hadoop ETL and Data Integration
> > Nube Technologies
> 
> Regards, Sonal. Can you give us more information about a basic workflow
> of your idea?
> 
> Some questions:
> - How do you know that two records are identical? By id?
> - Can you give a example of the ranking that you want to archieve with a
> match of each case:
> - two records that are identical
> - two records that ar similar
> - two records with the closest match
> 
> For MapReduce Design's Algoritms, I recommend to you this excelent from
> Ricky Ho:
> http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html
> 
> For the join of the two datasets, you can use Pig for this. Here you
> have a basic Pig example from Milind Bhandarkar
> (milindb@yahoo-inc.com)'s talk "Practical Problem Solving with Hadoop
> and Pig":
> Users = load ‘users’ as (name, age);
> Filtered = filter Users by age >= 18 and age <= 25;
> Pages = load ‘pages’ as (user, url);
> Joined = join Filtered by name, Pages by user;
> Grouped = group Joined by url;
> Summed = foreach Grouped generate group,
>            COUNT(Joined) as clicks;
> Sorted = order Summed by clicks desc;
> Top5 = limit Sorted 5;
> store Top5 into ‘top5sites’;
> 
> 
> --
>  Marcos Luís Ortíz Valmaseda
>  Software Engineer
>  Centro de Tecnologías de Gestión de Datos (DATEC)
>  Universidad de las Ciencias Informáticas
>  http://uncubanitolinuxero.blogspot.com
>  http://www.linkedin.com/in/marcosluis2186
> 
> 
> 

Mime
View raw message