hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Dataset comparison and ranking - views
Date Thu, 10 Mar 2011 06:38:44 GMT
The Mahout project has several tools for this class of problem.
http://mahout.apache.org

On Tue, Mar 8, 2011 at 9:07 AM, Chase Bradford <chase.bradford@gmail.com> wrote:
> How much smaller is the smaller dataset?  If you can use the DC and
> precompute bigrams, locations, etc, and hold all the results in memory
> during setup before mapping on the large dataset, then I would suggest that
> approach.
> Another trick I've seen for similar problems where the final score is a
> product of feature scores, is to cluster in a way that eliminates obvious
> 0s.  For example, if distance > 50km is a zero, then choose enough anchor
> coordinates to canvas the map with circles with radius 25km and overlap.
>  Then, your mapper would emit (coord, record) pairs for every anchor region
> the record is in.  That way, only records know to be similar in some way are
> considered.
> On Mar 7, 2011, at 9:21 PM, Sonal Goyal <sonalgoyal4@gmail.com> wrote:
>
> Hi Marcos,
>
> Thanks for replying. I think I was not very clear in my last post. Let me
> describe my use case in detail.
>
> I have two datasets coming from different sources, lets call them dataset1
> and dataset2. Both of them contain records for entities, say Person. A
> single record looks like:
>
> First Name Last Name,  Street, City, State,Zip
>
> We want to compare each record of dataset1 with each record of dataset2, in
> effect a cross join.
>
> We know that the way data is collected, names will not match exactly, but we
> want to find close enoughs. So we have a rule which says create bigrams and
> find the matching bigrams. If 0 to 5 match, give a score of 10, if 5-15
> match, give a score of 20 and so on.
> For Zip, we have our rule saying exact match or within 5 kms of each
> other(through a lookup), give a score of 50 and so on.
>
> Once we have each person of dataset1 compared with that of dataset2, we find
> the overall rank. Which is a weighted average of scores of name, address etc
> comparison.
>
> One approach is to use the DistributedCache for the smaller dataset and do a
> nested loop join in the mapper. The second approach is to use multiple  MR
> flows, and compare the fields and reduce/collate the results.
>
> I am curious to know if people have other approaches they have implemented,
> what are the efficiencies they have built up etc.
>
> Thanks and Regards,
> Sonal
> Hadoop ETL and Data Integration
> Nube Technologies
>
>
>
>
>
>
>
> On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz <mlortiz@uci.cu> wrote:
>>
>> On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
>> > Hi,
>> >
>> > I am working on a problem to compare two different datasets, and rank
>> > each record of the first with respect to the other, in terms of how
>> > similar they are. The records are dimensional, but do not have a lot
>> > of dimensions. Some of the fields will be compared for exact matches,
>> > some for similar sound, some with closest match etc. One of the
>> > datasets is large, and the other is much smaller.  The final goal is
>> > to compute a rank between each record of first dataset with each
>> > record of the second. The rank is based on weighted scores of each
>> > dimension comparison.
>> >
>> > I was wondering if people in the community have any advice/suggested
>> > patterns/thoughts about cross joining two datasets in map reduce. Do
>> > let me know if you have any suggestions.
>> >
>> > Thanks and Regards,
>> > Sonal
>> > Hadoop ETL and Data Integration
>> > Nube Technologies
>>
>> Regards, Sonal. Can you give us more information about a basic workflow
>> of your idea?
>>
>> Some questions:
>> - How do you know that two records are identical? By id?
>> - Can you give a example of the ranking that you want to archieve with a
>> match of each case:
>> - two records that are identical
>> - two records that ar similar
>> - two records with the closest match
>>
>> For MapReduce Design's Algoritms, I recommend to you this excelent from
>> Ricky Ho:
>>
>> http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html
>>
>> For the join of the two datasets, you can use Pig for this. Here you
>> have a basic Pig example from Milind Bhandarkar
>> (milindb@yahoo-inc.com)'s talk "Practical Problem Solving with Hadoop
>> and Pig":
>> Users = load ‘users’ as (name, age);
>> Filtered = filter Users by age >= 18 and age <= 25;
>> Pages = load ‘pages’ as (user, url);
>> Joined = join Filtered by name, Pages by user;
>> Grouped = group Joined by url;
>> Summed = foreach Grouped generate group,
>>            COUNT(Joined) as clicks;
>> Sorted = order Summed by clicks desc;
>> Top5 = limit Sorted 5;
>> store Top5 into ‘top5sites’;
>>
>>
>> --
>>  Marcos Luís Ortíz Valmaseda
>>  Software Engineer
>>  Centro de Tecnologías de Gestión de Datos (DATEC)
>>  Universidad de las Ciencias Informáticas
>>  http://uncubanitolinuxero.blogspot.com
>>  http://www.linkedin.com/in/marcosluis2186
>>
>>
>
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message