##### Site index · List index
Message view
Top
From Marcos Ortiz <mlor...@uci.cu>
Subject Re: Dataset comparison and ranking - views
Date Mon, 07 Mar 2011 19:25:01 GMT
```On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
> Hi,
>
> I am working on a problem to compare two different datasets, and rank
> each record of the first with respect to the other, in terms of how
> similar they are. The records are dimensional, but do not have a lot
> of dimensions. Some of the fields will be compared for exact matches,
> some for similar sound, some with closest match etc. One of the
> datasets is large, and the other is much smaller.  The final goal is
> to compute a rank between each record of first dataset with each
> record of the second. The rank is based on weighted scores of each
> dimension comparison.
>
> I was wondering if people in the community have any advice/suggested
> patterns/thoughts about cross joining two datasets in map reduce. Do
> let me know if you have any suggestions.
>
> Thanks and Regards,
> Sonal
> Hadoop ETL and Data Integration
> Nube Technologies

Some questions:
- How do you know that two records are identical? By id?
- Can you give a example of the ranking that you want to archieve with a
match of each case:
- two records that are identical
- two records that ar similar
- two records with the closest match

For MapReduce Design's Algoritms, I recommend to you this excelent from
Ricky Ho:
http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

For the join of the two datasets, you can use Pig for this. Here you
have a basic Pig example from Milind Bhandarkar
(milindb@yahoo-inc.com)'s talk "Practical Problem Solving with Hadoop
and Pig":
Users = load ‘users’ as (name, age);
Filtered = filter Users by age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
COUNT(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;

--
Marcos Luís Ortíz Valmaseda
Software Engineer
Centro de Tecnologías de Gestión de Datos (DATEC)