Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 689DE6543 for ; Thu, 26 May 2011 13:22:21 +0000 (UTC) Received: (qmail 55128 invoked by uid 500); 26 May 2011 13:22:18 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 55062 invoked by uid 500); 26 May 2011 13:22:18 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 55054 invoked by uid 99); 26 May 2011 13:22:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 May 2011 13:22:18 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gordoslocos@gmail.com designates 74.125.82.48 as permitted sender) Received: from [74.125.82.48] (HELO mail-ww0-f48.google.com) (74.125.82.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 May 2011 13:22:11 +0000 Received: by wwi18 with SMTP id 18so773723wwi.29 for ; Thu, 26 May 2011 06:21:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=HUxR0mHohTHZU+pTWIq7zNH/XJCW57TM5/0yuf/5S3o=; b=cZwoZ71t4DLAzI18PmXHa+Cu8uH+wuPm3zKa8w5T2tN/0/9KJYC/M9HfG6BPLmNa8D imLvOlpcz3QUZ3gJbpbIj0w3rUzGs29kWqji17mViSBMnRpjfTh6u0pK4rirwVyxLnkP WcKYCEVcpAgcb8ekWVZhd3/RRmZfEZxQUXI1g= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=TZqBJEuxEX9WjzwvrYKEAzxP9UaemfNHBaCGzXGQHwWR98/9Ts8ilgYQZ2HP3DZ0Sd frIP4xh4KE5uN12driVK7h/gN9+05GW3a3q6PRdcj0Vm5MYBceNOnwK3lSOwVtfXZ4hM zCNcaZsWayaUx1jGLAr/EmKS6JGhcCsEr/3QU= MIME-Version: 1.0 Received: by 10.216.60.4 with SMTP id t4mr780275wec.101.1306416111168; Thu, 26 May 2011 06:21:51 -0700 (PDT) Received: by 10.216.156.148 with HTTP; Thu, 26 May 2011 06:21:51 -0700 (PDT) In-Reply-To: References: Date: Thu, 26 May 2011 10:21:51 -0300 Message-ID: Subject: Re: Comparing From: "Juan P." To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=000e0ce0280e3c2ea304a42db459 X-Virus-Checked: Checked by ClamAV on apache.org --000e0ce0280e3c2ea304a42db459 Content-Type: text/plain; charset=ISO-8859-1 Harsh, Thanks for your response, it was very helpful. There are still a couple of things which are not really clear to me though. You say that "Keys have got to be compared by the MR framework". But I'm still not 100% sure why keys are sorted. I thought what hadoop did was, during shuffling it chose which keys went to which reducer and then for each key/value it checked the key and sent them to the correct node. If that was the case then a good equals implementation could be enough. So why instead of just *shuffling* does the MP framework *sort* the items? Also, you were very clear about the use of RawComparator, thank you. Do you know how RawComparable works though? Again, thanks for your help! Cheers, Pony On Thu, May 26, 2011 at 1:58 AM, Harsh J wrote: > Pony, > > Keys have got to be compared by the MR framework somehow, and the way > it does when you use Writables is by ensuring that your Key is of a > Writable + Comparable type (WritableComparable). > > If you specify a specific comparator class, then that will be used; > else the default WritableComparator will get asked if it can supply a > comparator for use with your key type. > > AFAIK, the default WritableComparator wraps around RawComparator and > does indeed deserialize the writables before applying the compare > operation. The RawComparator's primary idea is to give you a pair of > raw byte sequences to compare directly. Certain other serialization > libraries (Apache Avro is one) provide ways to compare using bytes > itself (Across different types), which can end up being faster when > used in jobs. > > Hope this clears up your confusion. > > On Tue, May 24, 2011 at 2:06 AM, Juan P. wrote: > > Hi guys, > > I wanted to get your help with a couple of questions which came up while > > looking at the Hadoop Comparator/Comparable architecture. > > > > As I see it before each reducer operates on each key, a sorting algorithm > is > > applied to them. *Why does Hadoop need to do that?* > > > > If I implement my own class and I intend to use it as a Key I must allow > for > > instances of my class to be compared. So I have 2 choices: I can > implement > > WritableComparable or I can register a WritableComparator for my > > class. Should I fail to do either, would the Job fail? > > If I register my WritableComparator which does not use the Comparable > > interface at all, does my Key need to implement WritableComparable? > > If I don't implement my Comparator and my Key implements > WritableComparable, > > does it mean that Hadoop will deserialize my Keys twice? (once for > sorting, > > and once for reducing) > > What is RawComparable used for? > > > > Thanks for your help! > > Pony > > > > > > -- > Harsh J > --000e0ce0280e3c2ea304a42db459--