Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DBD1310269 for ; Fri, 28 Feb 2014 17:38:41 +0000 (UTC) Received: (qmail 53473 invoked by uid 500); 28 Feb 2014 17:38:22 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 53359 invoked by uid 500); 28 Feb 2014 17:38:18 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 53245 invoked by uid 99); 28 Feb 2014 17:38:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Feb 2014 17:38:08 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jpforny@gmail.com designates 74.125.83.52 as permitted sender) Received: from [74.125.83.52] (HELO mail-ee0-f52.google.com) (74.125.83.52) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Feb 2014 17:38:01 +0000 Received: by mail-ee0-f52.google.com with SMTP id c41so2455302eek.11 for ; Fri, 28 Feb 2014 09:37:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=oJIUc+pLzC7JR7Vw3hp7HNqBWfUieyjGG2FiNL7bK2k=; b=iPZswGabjBLp23i6xc97wwzOXbbew9QjfOqaLdGQWxOj2MrmmWI/ZLkvqArNiuPB/Y 8q8W6JWsKlyQAzXCVXaklWKpuug1rOxncbRc2HD7t2iOr6HlH16MKjma0WCfQdj/mczt XUzKqBtxsXaPJwjJQVYm5HxO6Vi8NyvFWKotEazrOxPYQrkdcGPrRTAjQ7dGpIfvPMu8 jnpSNoKkIqzPp38bdqnKafydpp8/N2866XWstX2p8eTvvBYsBfK5LW/ek6P7GUZ1NPU7 XJPEEafgBkWgQkhdIcDhROZWy1sAZjPnvoCr2CIZCfwNt+6CYnKpz1iNQvSXhYQ7aqdp QpRw== X-Received: by 10.204.77.7 with SMTP id e7mr4575707bkk.7.1393609061170; Fri, 28 Feb 2014 09:37:41 -0800 (PST) MIME-Version: 1.0 Received: by 10.205.70.4 with HTTP; Fri, 28 Feb 2014 09:37:01 -0800 (PST) From: =?ISO-8859-1?Q?Jo=E3o_Paulo_Forny?= Date: Fri, 28 Feb 2014 14:37:01 -0300 Message-ID: Subject: Reduce side join of similar records To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7bdc8c8e0c27e704f37ae506 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdc8c8e0c27e704f37ae506 Content-Type: text/plain; charset=ISO-8859-1 I'm implementing a join between two datasets A and B by a String key, which is the name attribute. I need to match similar names in this join. My first thought, given that I was implementing secondary sort to get the values extracted from database A before the values from database B, was to create a grouping comparator class and instead of using the compareTo method to group values by the natural key, use a string similarity algorithm, but it has not worked as expected, since that names that match in my algorithm wasn't mapped as the same key. See my code below. public class StringSimilarityGroupingComparator extends WritableComparator { protected StringSimilarityGroupingComparator() { super(JoinKeyTagPairWritable.class, true); } public int compare(WritableComparable w1, WritableComparable w2) { JoinKeyTagPairWritable k1 = (JoinKeyTagPairWritable) w1; JoinKeyTagPairWritable k2 = (JoinKeyTagPairWritable) w2; StringSimilarityMatcher nameMatcher = new StringSimilarityMatcher( StringSimilarityMatcher.NAME_MATCH); return nameMatcher.match(k1.getJoinKey(), k2.getJoinKey()) ? 0 : k1 .getJoinKey().compareTo(k2.getJoinKey()); } This approach makes total sense to me. Where was I mistaken? Isn't this the purpose of overriding the grouping comparator class? --047d7bdc8c8e0c27e704f37ae506 Content-Type: text/html; charset=ISO-8859-1
I'm implementing a join between two datasets A and B by a String key, which is the name attribute. I need to match similar names in this join.

My first thought, given that I was implementing secondary sort to get the values extracted from database A before the values from database B, was to create a grouping comparator class and instead of using the compareTo method to group values by the natural key, use a string similarity algorithm, but it has not worked as expected, since that names that match in my algorithm wasn't mapped as the same key. See my code below.

public class StringSimilarityGroupingComparator extends WritableComparator {

protected StringSimilarityGroupingComparator() {
    super(JoinKeyTagPairWritable.class, true);
}

public int compare(WritableComparable w1, WritableComparable w2) {
    JoinKeyTagPairWritable k1 = (JoinKeyTagPairWritable) w1;
    JoinKeyTagPairWritable k2 = (JoinKeyTagPairWritable) w2;
    StringSimilarityMatcher nameMatcher = new StringSimilarityMatcher(
            StringSimilarityMatcher.NAME_MATCH);

    return nameMatcher.match(k1.getJoinKey(), k2.getJoinKey()) ? 0 : k1
            .getJoinKey().compareTo(k2.getJoinKey());
}

This approach makes total sense to me. Where was I mistaken? Isn't this the purpose of overriding the grouping comparator class?

--047d7bdc8c8e0c27e704f37ae506--