Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 7340 invoked from network); 7 Mar 2011 19:54:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Mar 2011 19:54:30 -0000 Received: (qmail 47766 invoked by uid 500); 7 Mar 2011 19:54:29 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 47733 invoked by uid 500); 7 Mar 2011 19:54:29 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 47725 invoked by uid 99); 7 Mar 2011 19:54:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Mar 2011 19:54:29 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mlortiz@uci.cu designates 200.55.140.180 as permitted sender) Received: from [200.55.140.180] (HELO mx3.uci.cu) (200.55.140.180) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 07 Mar 2011 19:54:24 +0000 Received: (qmail 3499 invoked by uid 507); 7 Mar 2011 19:53:56 -0000 Received: from 10.0.0.184 by ns3.uci.cu (envelope-from , uid 501) with qmail-scanner-2.01st (avp: 5.0.2.0. spamassassin: 3.0.6. perlscan: 2.01st. Clear:RC:1(10.0.0.184):. Processed in 0.628889 secs); 07 Mar 2011 19:53:56 -0000 Received: from unknown (HELO ucimail3.uci.cu) (10.0.0.184) by 0 with SMTP; 7 Mar 2011 19:53:55 -0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by ucimail3.uci.cu (Postfix) with ESMTP id 9AF9C1E8C06A; Mon, 7 Mar 2011 14:53:55 -0500 (CST) X-Virus-Scanned: amavisd-new at uci.cu Received: from ucimail3.uci.cu ([127.0.0.1]) by localhost (ucimail3.uci.cu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 724X-RWWBAvw; Mon, 7 Mar 2011 14:53:54 -0500 (CST) Received: from [10.36.18.44] (marcosluis-aspire-5251.uci.cu [10.36.18.44]) by ucimail3.uci.cu (Postfix) with ESMTP id 992BD1E8C04C; Mon, 7 Mar 2011 14:53:54 -0500 (CST) Subject: Re: Dataset comparison and ranking - views From: Marcos Ortiz To: Sonal Goyal Cc: mapreduce-user@hadoop.apache.org In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Organization: UCI Date: Mon, 07 Mar 2011 14:55:01 -0430 Message-ID: <1299525901.3597.20.camel@marcosluis-Aspire-5251> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Content-Transfer-Encoding: 8bit On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote: > Hi, > > I am working on a problem to compare two different datasets, and rank > each record of the first with respect to the other, in terms of how > similar they are. The records are dimensional, but do not have a lot > of dimensions. Some of the fields will be compared for exact matches, > some for similar sound, some with closest match etc. One of the > datasets is large, and the other is much smaller. The final goal is > to compute a rank between each record of first dataset with each > record of the second. The rank is based on weighted scores of each > dimension comparison. > > I was wondering if people in the community have any advice/suggested > patterns/thoughts about cross joining two datasets in map reduce. Do > let me know if you have any suggestions. > > Thanks and Regards, > Sonal > Hadoop ETL and Data Integration > Nube Technologies Regards, Sonal. Can you give us more information about a basic workflow of your idea? Some questions: - How do you know that two records are identical? By id? - Can you give a example of the ranking that you want to archieve with a match of each case: - two records that are identical - two records that ar similar - two records with the closest match For MapReduce Design's Algoritms, I recommend to you this excelent from Ricky Ho: http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html For the join of the two datasets, you can use Pig for this. Here you have a basic Pig example from Milind Bhandarkar (milindb@yahoo-inc.com)'s talk "Practical Problem Solving with Hadoop and Pig": Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, COUNT(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; -- Marcos Luís Ortíz Valmaseda Software Engineer Centro de Tecnologías de Gestión de Datos (DATEC) Universidad de las Ciencias Informáticas http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186