Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4EE2DD588 for ; Tue, 21 Aug 2012 09:40:38 +0000 (UTC) Received: (qmail 94923 invoked by uid 500); 21 Aug 2012 09:40:34 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 93854 invoked by uid 500); 21 Aug 2012 09:40:33 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 93836 invoked by uid 99); 21 Aug 2012 09:40:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Aug 2012 09:40:32 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of harsh@cloudera.com designates 209.85.214.176 as permitted sender) Received: from [209.85.214.176] (HELO mail-ob0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Aug 2012 09:40:26 +0000 Received: by obbtb18 with SMTP id tb18so13969414obb.35 for ; Tue, 21 Aug 2012 02:40:05 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding:x-gm-message-state; bh=9WI7L6PxmGeZt4Kk+KVKwbflaYTfqt4oblrqQK0MCv8=; b=a+094A8BGZw49x8VgEuk8ddGb1k6+lErG4qhXb9Gt8q+djE3FFl3u7PiToQDbv8puf QEKu7qPwd1UFwrzenpzDC4/AZowlWhwN+0dnoyw0M2UI/pzekMDkCTbvd+xSEKDPxSU4 361jTf+dzfvB0oDm2Xmf0AVp1jWWiW8bu2cZoUtqmUFJ4DJJaYBqyicNV3hzP0z30GEY samAhDhJmZbKeVPFjnqtxns3ZQgvYYeHEj0NtbjdlR1yeVt4sg73lE8Nd48XYqJAQ3ZI +4+aHqssAuh8SZwih6Hq4lAftvX8+9STw2Yft0vIdcv7nCf2qh5M3lGWde0zvW8dlAwI y7kA== Received: by 10.60.5.197 with SMTP id u5mr12455788oeu.90.1345542005263; Tue, 21 Aug 2012 02:40:05 -0700 (PDT) MIME-Version: 1.0 Received: by 10.76.11.168 with HTTP; Tue, 21 Aug 2012 02:39:44 -0700 (PDT) In-Reply-To: References: From: Harsh J Date: Tue, 21 Aug 2012 15:09:44 +0530 Message-ID: Subject: Re: Extension points available for data locality To: user@hadoop.apache.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQkMhX1HpQCBZUuAeNcQL9wxOQM2tFRLInKCE+c8oBhuwo1Ny2yOXgkAf8mGKWBbY+dGe8gF Tharindu, (Am assuming you've done enough research to know that there's benefit in what you're attempting to do.) Locality of tasks are determined by the job's InputFormat class. Specifically, the locality information returned by the InputSplit objects via InputFormat#getSplits(=85) API is what the MR scheduler looks at when trying to launch data local tasks. You can tweak your InputFormat (the one that uses this DB as input?) to return relevant locations based on your "DB Cluster", in order to achieve this. On Tue, Aug 21, 2012 at 2:36 PM, Tharindu Mathew wrot= e: > Hi, > > I'm doing some research that involves pulling data stored in a mysql clus= ter > directly for a map reduce job, without storing the data in HDFS. > > I'd like to run hadoop task tracker nodes directly on the mysql cluster > nodes. The purpose of this being, starting mappers directly in the node > closest to the data if possible (data locality). > > I notice that with HDFS, since the name node knows exactly where each dat= a > block is, it uses this to achieve data locality. > > Is there a way to achieve my requirement possibly by extending the name n= ode > or otherwise? > > Thanks in advance. > > -- > Regards, > > Tharindu > > blog: http://mackiemathew.com/ > --=20 Harsh J