hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From João Paulo Forny <jpfo...@gmail.com>
Subject Reduce side join of similar records
Date Fri, 28 Feb 2014 17:37:01 GMT
I'm implementing a join between two datasets A and B by a String key, which
is the name attribute. I need to match similar names in this join.

My first thought, given that I was implementing secondary sort to get the
values extracted from database A before the values from database B, was to
create a grouping comparator class and instead of using the compareTo
method to group values by the natural key, use a string similarity
algorithm, but it has not worked as expected, since that names that match
in my algorithm wasn't mapped as the same key. See my code below.

public class StringSimilarityGroupingComparator extends WritableComparator {

protected StringSimilarityGroupingComparator() {
    super(JoinKeyTagPairWritable.class, true);
}

public int compare(WritableComparable w1, WritableComparable w2) {
    JoinKeyTagPairWritable k1 = (JoinKeyTagPairWritable) w1;
    JoinKeyTagPairWritable k2 = (JoinKeyTagPairWritable) w2;
    StringSimilarityMatcher nameMatcher = new StringSimilarityMatcher(
            StringSimilarityMatcher.NAME_MATCH);

    return nameMatcher.match(k1.getJoinKey(), k2.getJoinKey()) ? 0 : k1
            .getJoinKey().compareTo(k2.getJoinKey());
}

This approach makes total sense to me. Where was I mistaken? Isn't this the
purpose of overriding the grouping comparator class?

Mime
View raw message