hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <sonalgoy...@gmail.com>
Subject Dataset comparison and ranking - views
Date Mon, 07 Mar 2011 19:06:42 GMT
Hi,

I am working on a problem to compare two different datasets, and rank each
record of the first with respect to the other, in terms of how similar they
are. The records are dimensional, but do not have a lot of dimensions. Some
of the fields will be compared for exact matches, some for similar sound,
some with closest match etc. One of the datasets is large, and the other is
much smaller.  The final goal is to compute a rank between each record of
first dataset with each record of the second. The rank is based on weighted
scores of each dimension comparison.

I was wondering if people in the community have any advice/suggested
patterns/thoughts about cross joining two datasets in map reduce. Do let me
know if you have any suggestions.

Thanks and Regards,
Sonal
<https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
Integration<https://github.com/sonalgoyal/hiho>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>

Mime
View raw message