Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 12F34E24E for ; Fri, 22 Feb 2013 07:57:24 +0000 (UTC) Received: (qmail 10827 invoked by uid 500); 22 Feb 2013 07:57:19 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 10648 invoked by uid 500); 22 Feb 2013 07:57:19 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 10641 invoked by uid 99); 22 Feb 2013 07:57:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Feb 2013 07:57:18 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erlv5241@gmail.com designates 209.85.214.175 as permitted sender) Received: from [209.85.214.175] (HELO mail-ob0-f175.google.com) (209.85.214.175) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Feb 2013 07:57:14 +0000 Received: by mail-ob0-f175.google.com with SMTP id uz6so332680obc.20 for ; Thu, 21 Feb 2013 23:56:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=dBJqdFn2A7nKg8qxNTna1fwBEc4JBmnf/89nymWqUgU=; b=W3q8JoidjJ2vLxvR96XBvglRv6jiuM7J5LpUCsF+CB6r2P1wvhaimxD3r7QoG5/GE8 jEYy1oCVYZNlP9vsaCkE1xKn29i30vfUWUd64xQcOnWDWsId83MC9YJbJ1wo3Ch0hWy3 4ClC2Usa06yCQnKYzVXNO13VOI9CsB6qkedNprO0feeb/RQLoB/EJFik7GiW1+j5LMBs ThK08rUDdKow0E/xeJqhbBNd4FsX+OuIc9aNA7YzXF6NX+IFVeUzjOg3L5hHxtGr1C7g chQ8y8sjQ+PM28uF26sx5KkLDAOwUjYei5frQx8Z1vYEIQR6y5PorOFHiauC7qcgpejk qmVA== X-Received: by 10.182.192.3 with SMTP id hc3mr405193obc.41.1361519813923; Thu, 21 Feb 2013 23:56:53 -0800 (PST) MIME-Version: 1.0 Received: by 10.76.87.73 with HTTP; Thu, 21 Feb 2013 23:56:33 -0800 (PST) In-Reply-To: <7B0D51053A50034199FF706B2513104F09C250A1@SACEXCMBX01-PRD.hq.netapp.com> References: <7B0D51053A50034199FF706B2513104F09C250A1@SACEXCMBX01-PRD.hq.netapp.com> From: Ling Kun Date: Fri, 22 Feb 2013 15:56:33 +0800 Message-ID: Subject: Re: How to add another file system in Hadoop To: user Content-Type: multipart/alternative; boundary=f46d041fa0c5dd12e504d64b8856 X-Virus-Checked: Checked by ClamAV on apache.org --f46d041fa0c5dd12e504d64b8856 Content-Type: text/plain; charset=ISO-8859-1 Dear Nikhil and all, Your question is a bit complex to answer, and since I am not expert of Hadoop currently, the following answer may have some errors, any suggestion is welcome. 1. You MR command is issued by the client submitting a job to JobTracker of the Hadoop cluster. 2. The JobTracker will split the input file (Usually according to the blocksize of the underlying DFS), and then the jobtracker will have a number of map task and reduce task, usually each Map Task will eat one block, and write down some intermediate data. 3. JobTracker will schedule these tasks to different TaskTrackers according to the block location in the DFS. The block is the one which the map task will eat. If unfortunately the map task can not assign to the TaskTracker which have the block stored. The data of the block will be transferred to the node which the task will run ( This is done in the underlying DFS object, and this is where *getFileBlockLocations* take effect, and the MR framework will not realize it) 4.So, you see, your client will not collect all remote data to local, it only submit a job, tell the JobTracker: how to split the input file, how to do map, how to combine the intermediate data, how to do reduce, where the input file is in the DFS, and where to output the data in DFS. Maybe you should search for some blog post, or refer to the written by Tom White for more authoritative answer. yours, Ling Kun On Fri, Feb 22, 2013 at 1:05 PM, Agarwal, Nikhil wrote: > Hi All,**** > > ** ** > > Thanks a lot for taking out your time to answer my question.**** > > ** ** > > Ling, thank you for directing me to glusterfs. I can surely take lot of > help from that but what I wanted to know is that in README.txt it is > mentioned :**** > > ** ** > > >> # ./bin/start-mapred.sh**** > > If the map/reduce job/task trackers are up, all I/O will be done to > GlusterFS.**** > > ** ** > > So, suppose my input files are scattered in different nodes(glusterfs > servers), how do I(hadoop client having glusterfs plugged in) issue a > Mapreduce command?**** > > Moreover, after issuing a Mapreduce command would my hadoop client fetch > all the data from different servers to my local machine and then do a > Mapreduce or would it start the TaskTracker daemons on the machine(s) where > the input file(s) are located and perform a Mapreduce there?**** > > Please rectify me if I am wrong but I suppose that the location of input > files top Mapreduce is being returned by the function * > getFileBlockLocations* *(*FileStatus file*,* *long* start*,* *long* len*). > ***** > > ** ** > > Thank you very much for your time and helping me out.**** > > ** ** > > Regards,**** > > Nikhil**** > > ** ** > > *From:* Agarwal, Nikhil > *Sent:* Thursday, February 21, 2013 4:19 PM > *To:* 'user@hadoop.apache.org' > *Subject:* How to add another file system in Hadoop**** > > ** ** > > Hi,**** > > ** ** > > I am planning to add a file system called CDMI under org.apache.hadoop.fs > in Hadoop, something similar to KFS or S3 which are already there under > org.apache.hadoop.fs. I wanted to ask that say, I write my file system for > CDMI and add the package under fs but then how do I tell the core-site.xml > or other configuration files to use CDMI file system. Where all do I need > to make changes to enable CDMI file system become a part of Hadoop ?**** > > ** ** > > Thanks a lot in advance.**** > > ** ** > > Regards,**** > > Nikhil > > -- > http://www.lingcc.com > --f46d041fa0c5dd12e504d64b8856 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Dear Nikhil and al= l,
=A0=A0=A0 Your question is a bit complex to answer, and since I am not expert of=20 Hadoop currently, the following answer may have some errors, any=20 suggestion is welcome.

1.=A0 You MR command is issued by the client submitting a j= ob to JobTracker of the Hadoop cluster.
2. The JobTracker will split the input file (Usually according to the=20 blocksize of the underlying DFS), and then the jobtracker will have a=20 number of map task and reduce task, usually=A0 each Map Task will eat one= =20 block, and write down some intermediate data.
3. JobTracker will schedule these tasks to different TaskTrackers=20 according to the block location in the DFS. The block is the one which=20 the map task will eat. If unfortunately the map task can not assign to=20 the TaskTracker which have the block stored. The data of the block will=20 be transferred to the node which the task will run ( This is done in the underlying DFS object, and this is where getFileBlockLocation= s take effect,=A0 and the MR framework will not realize it)

4.So, you see, your client will not collect all remote data to local, it only submit a job, tell the JobTracker: how to split the=20 input file, how to do map, how to combine the intermediate data,=A0 how to do reduce, where the input file is in the DFS, and where to output the=20 data in DFS.


Maybe you should search for some blog post, or refer=20 to=A0 the <Hadoop: The definitive guide> written by Tom White for=20 more authoritative answer.

yours,
Ling Kun=


On Fri,= Feb 22, 2013 at 1:05 PM, Agarwal, Nikhil <Nikhil.Agarwal@netapp.c= om> wrote:

Hi All,<= /span>

=A0

Thanks a lot for takin= g out your time to answer my question.

=A0

Ling, thank you for di= recting me to glusterfs. I can surely take lot of help from that but what I= wanted to know is that in README.txt it is mentioned :

=A0

>> # ./bin/start-mapred.sh=

=A0=A0If the map/reduce job/tas= k trackers are up, all I/O will be done to GlusterFS.<= /p>

=A0

So, suppose my input files are = scattered in different nodes(glusterfs servers), how do I(hadoop client hav= ing glusterfs plugged in) issue a Mapreduce command?

Moreover, after issuing a Mapre= duce command would my hadoop client fetch all the data from different serve= rs to my local machine and then do a Mapreduce or would it start the TaskTracker daemons on the machine(s) where the inpu= t file(s) are located and perform a Mapreduce there?

Please rectify me if I am wrong= but I suppose that the location of input files top Mapreduce is being retu= rned by the function getFileBl= ockLocations (<= /b>Fi= leStatus file<= /span>, long start= ,<= /span> long len).

=A0

Thank you very much fo= r your time and helping me out.

=A0

Regards,=

Nikhil

=A0

From: Agarwal,= Nikhil
Sent: Thursday, February 21, 2013 4:19 PM
To: 'user@hadoop.apache.org'
Subject: How to add another file system in Hadoop

=A0

Hi,

=A0

I am planning to add a file system called CDMI under= org.apache.hadoop.fs in Hadoop, something similar to KFS or S3 which are a= lready there under org.apache.hadoop.fs. I wanted to ask that say, I write = my file system for CDMI and add the package under fs but then how do I tell the core-site.xml or other configu= ration files to use CDMI file system. Where all do I need to make changes t= o enable CDMI file system become a part of Hadoop ?

=A0

Thanks a lot in advance.

=A0

Regards,

Nikhil

--
http://www.lingcc.com

--f46d041fa0c5dd12e504d64b8856--