Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of mlortiz@uci.cu designates
 200.55.140.180 as permitted sender)
Subject: Re: Dataset comparison and ranking - views
From: Marcos Ortiz <mlortiz@uci.cu>
To: Sonal Goyal <sonalgoyal4@gmail.com>
Cc: mapreduce-user@hadoop.apache.org
In-Reply-To: <AANLkTi=9p1dYOM_Gf5CrnX=Yg+BWOjj+9W_HZkU4cvht@mail.gmail.com>
References: <AANLkTi=9p1dYOM_Gf5CrnX=Yg+BWOjj+9W_HZkU4cvht@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Organization: UCI
Date: Mon, 07 Mar 2011 14:55:01 -0430
Message-ID: <1299525901.3597.20.camel@marcosluis-Aspire-5251>
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit

On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
> Hi,
> 
> I am working on a problem to compare two different datasets, and rank
> each record of the first with respect to the other, in terms of how
> similar they are. The records are dimensional, but do not have a lot
> of dimensions. Some of the fields will be compared for exact matches,
> some for similar sound, some with closest match etc. One of the
> datasets is large, and the other is much smaller.  The final goal is
> to compute a rank between each record of first dataset with each
> record of the second. The rank is based on weighted scores of each
> dimension comparison.
> 
> I was wondering if people in the community have any advice/suggested
> patterns/thoughts about cross joining two datasets in map reduce. Do
> let me know if you have any suggestions.   
> 
> Thanks and Regards,
> Sonal
> Hadoop ETL and Data Integration
> Nube Technologies 

Regards, Sonal. Can you give us more information about a basic workflow
of your idea?

Some questions:
- How do you know that two records are identical? By id?
- Can you give a example of the ranking that you want to archieve with a
match of each case:
- two records that are identical
- two records that ar similar
- two records with the closest match

For MapReduce Design's Algoritms, I recommend to you this excelent from
Ricky Ho:
http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

For the join of the two datasets, you can use Pig for this. Here you
have a basic Pig example from Milind Bhandarkar
(milindb@yahoo-inc.com)'s talk "Practical Problem Solving with Hadoop
and Pig":
Users = load ‘users’ as (name, age);
Filtered = filter Users by age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
            COUNT(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;


-- 
 Marcos Luís Ortíz Valmaseda
 Software Engineer
 Centro de Tecnologías de Gestión de Datos (DATEC)
 Universidad de las Ciencias Informáticas
 http://uncubanitolinuxero.blogspot.com
 http://www.linkedin.com/in/marcosluis2186