hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thamizhannal Paramasivam <thamizhanna...@gmail.com>
Subject Re: hadoop
Date Fri, 06 Jan 2012 06:00:18 GMT
Hi,

For (a) & (d) Refer http://wiki.apache.org/hadoop/HowManyMapsAndReduces

For (b), Package your job as .jar and invoke hadoop command as below. It
gets copied to all the data nodes.
E.g. $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
/usr/joe/wordcount/input /usr/joe/wordcount/output
http://wiki.apache.org/hadoop/HowManyMapsAndReduces

For (c), As soon as you put come files they gets copied to all the name
nodes. You need not worry about physical location of data. By using hadoop
fs -ls/cat command you can verify your input files.

For (e)
http://wiki.apache.org/hadoop/
http://www.cloudera.com/resources/

thanks,
Thamizh

On Thu, Jan 5, 2012 at 11:07 PM, Satish Setty (HCL Financial Services) <
Satish.Setty@hcl.com> wrote:

> Hello,
>
> We are trying to use Hadoop-0.20.203.0rc1 for parallel computation.  Below
> are queries
>
> Assume single node of high configuration machine 8 cores and 8gb memory.
>
> (a) How do we know number of  map tasks spawned?  Can this be controlled?
> We notice only 4 jvms running on a single node - namenode, datanode,
> jobtracker, tasktracker. As we understand depending on number of splits
> that many map tasks are spawned - so we should see that many increase in
> jvms.
>
> (b) Our mapper class should perform complex computations - it has plenty
> of dependent jars so how do we add all jars in class path  while running
> application? Since we require to perform parallel computations - we need
> many map tasks running in parallel with different data. All are in same
> machine with different jvms.
>
> (c) How does data split happen?  JobClient does not talk about data
> splits? As we understand we create format for distributed file system,
> start-all.sh and then "hadoop fs -put". Do this write data to all
> datanodes? But we are unable to see physical location? How does split
> happen from this hdfs source?
>
> (d) Can we control number of reduce tasks? Is this seperate jvm?  How are
>  optimal numbers for  map and reduce tasks determined?
>
> (e) Any good documentation/links which speaks about namenode, datanode,
> jobtracker and tasktracker.
>
> Kindly help.
>
> Thanks
>
> ________________________________________
> From: mapreduce-user-help@hadoop.apache.org [
> mapreduce-user-help@hadoop.apache.org]
> Sent: Thursday, January 05, 2012 10:49 PM
> To: Satish Setty (HCL Financial Services)
> Subject: WELCOME to mapreduce-user@hadoop.apache.org
>
> Hi! This is the ezmlm program. I'm managing the
> mapreduce-user@hadoop.apache.org mailing list.
>
> Acknowledgment: I have added the address
>
>   Satish.Setty@hcl.com
>
> to the mapreduce-user mailing list.
>
> Welcome to mapreduce-user@hadoop.apache.org!
>
> Please save this message so that you know the address you are
> subscribed under, in case you later want to unsubscribe or change your
> subscription address.
>
>
> --- Administrative commands for the mapreduce-user list ---
>
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
>
> To subscribe to the list, send a message to:
>   <mapreduce-user-subscribe@hadoop.apache.org>
>
> To remove your address from the list, send a message to:
>   <mapreduce-user-unsubscribe@hadoop.apache.org>
>
> Send mail to the following for info and FAQ for this list:
>   <mapreduce-user-info@hadoop.apache.org>
>   <mapreduce-user-faq@hadoop.apache.org>
>
> Similar addresses exist for the digest list:
>   <mapreduce-user-digest-subscribe@hadoop.apache.org>
>   <mapreduce-user-digest-unsubscribe@hadoop.apache.org>
>
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>   <mapreduce-user-get.123_145@hadoop.apache.org>
>
> To get an index with subject and author for messages 123-456 , mail:
>   <mapreduce-user-index.123_456@hadoop.apache.org>
>
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
>
> To receive all messages with the same subject as message 12345,
> send a short message to:
>   <mapreduce-user-thread.12345@hadoop.apache.org>
>
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
>
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
> <mapreduce-user-subscribe-john=host.domain@hadoop.apache.org>
>
> To stop subscription for this address, mail:
> <mapreduce-user-unsubscribe-john=host.domain@hadoop.apache.org>
>
> In both cases, I'll send a confirmation message to that address. When
> you receive it, simply reply to it to complete your subscription.
>
> If despite following these instructions, you do not get the
> desired results, please contact my owner at
> mapreduce-user-owner@hadoop.apache.org. Please be patient, my owner is a
> lot slower than I am ;-)
>
> --- Enclosed is a copy of the request I received.
>
> Return-Path: <Satish.Setty@hcl.com>
> Received: (qmail 88603 invoked by uid 99); 5 Jan 2012 17:19:18 -0000
> Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
>    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 17:19:18
> +0000
> X-ASF-Spam-Status: No, hits=-0.0 required=5.0
>        tests=SPF_PASS
> X-Spam-Check-By: apache.org
> Received-SPF: pass (athena.apache.org: domain of Satish.Setty@hcl.comdesignates 203.105.186.23
as permitted sender)
> Received: from [203.105.186.23] (HELO gws07.hcl.com) (203.105.186.23)
>    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 17:19:13
> +0000
> Received: from chn-hclin-ht01.CORP.HCL.IN (10.249.64.35) by
>  CHN-HCLIN-EDGE3.HCL.COM (10.249.64.140) with Microsoft SMTP Server id
>  8.2.254.0; Thu, 5 Jan 2012 22:45:40 +0530
> Received: from CHN-HCLT-HT04.HCLT.CORP.HCL.IN (10.108.45.37) by
>  chn-hclin-ht01.CORP.HCL.IN (10.249.64.35) with Microsoft SMTP Server
> (TLS) id
>  8.2.254.0; Thu, 5 Jan 2012 22:48:48 +0530
> Received: from CHN-HCLT-EVS07.HCLT.CORP.HCL.IN([fe80::3d0d:efa3:3da8:2ae9])
>  by CHN-HCLT-HT04.HCLT.CORP.HCL.IN ([::1]) with mapi; Thu, 5 Jan 2012
> 22:48:47
>  +0530
> From: "Satish Setty (HCL Financial Services)" <Satish.Setty@hcl.com>
> To:
>        "mapreduce-user-sc.1325782989.apjoeicfclfanpacjgbo-Satish.Setty=
> hcl.com@hadoop.apache.org"
>        <mapreduce-user-sc.1325782989.apjoeicfclfanpacjgbo-Satish.Setty=
> hcl.com@hadoop.apache.org>
> Date: Thu, 5 Jan 2012 22:48:15 +0530
> Subject: RE: confirm subscribe to mapreduce-user@hadoop.apache.org
> Thread-Topic: confirm subscribe to mapreduce-user@hadoop.apache.org
> Thread-Index: AczLy+7sWw3//jlYTHOm2lkHBLCB8wAAhPxh
> Message-ID: <
> 620012C16AC105498BB52AC8FD9745280265386D1A@CHN-HCLT-EVS07.HCLT.CORP.HCL.IN
> >
> References: <1325782989.49529.ezmlm@hadoop.apache.org>
> In-Reply-To: <1325782989.49529.ezmlm@hadoop.apache.org>
> Accept-Language: en-US
> Content-Language: en-US
> X-MS-Has-Attach:
> X-MS-TNEF-Correlator:
> acceptlanguage: en-US
> Content-Type: text/plain; charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
> MIME-Version: 1.0
>
> thanks
>
> ________________________________________
> From: mapreduce-user-help@hadoop.apache.org[mapreduce-user-help@hadoop.apa
> =
> che.org]
> Sent: Thursday, January 05, 2012 10:33 PM
> To: Satish Setty (HCL Financial Services)
> Subject: confirm subscribe to mapreduce-user@hadoop.apache.org
>
> Hi! This is the ezmlm program. I'm managing the
> mapreduce-user@hadoop.apache.org mailing list.
>
> To confirm that you would like
>
>   Satish.Setty@hcl.com
>
> added to the mapreduce-user mailing list, please send
> a short reply to this address:
>
>   mapreduce-user-sc.1325782989.apjoeicfclfanpacjgbo-Satish.Setty=3Dhcl.com=
> @hadoop.apache.org
>
> Usually, this happens when you just hit the "reply" button.
> If this does not work, simply copy the address and paste it into
> the "To:" field of a new message.
>
> This confirmation serves two purposes. First, it verifies that I am able
> to get mail through to you. Second, it protects you in case someone
> forges a subscription request in your name.
>
> Some mail programs are broken and cannot handle long addresses. If you
> cannot reply to this request, instead send a message to
> <mapreduce-user-request@hadoop.apache.org> and put the
> entire address listed above into the "Subject:" line.
>
>
> --- Administrative commands for the mapreduce-user list ---
>
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
>
> To subscribe to the list, send a message to:
>   <mapreduce-user-subscribe@hadoop.apache.org>
>
> To remove your address from the list, send a message to:
>   <mapreduce-user-unsubscribe@hadoop.apache.org>
>
> Send mail to the following for info and FAQ for this list:
>   <mapreduce-user-info@hadoop.apache.org>
>   <mapreduce-user-faq@hadoop.apache.org>
>
> Similar addresses exist for the digest list:
>   <mapreduce-user-digest-subscribe@hadoop.apache.org>
>   <mapreduce-user-digest-unsubscribe@hadoop.apache.org>
>
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>   <mapreduce-user-get.123_145@hadoop.apache.org>
>
> To get an index with subject and author for messages 123-456 , mail:
>   <mapreduce-user-index.123_456@hadoop.apache.org>
>
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
>
> To receive all messages with the same subject as message 12345,
> send a short message to:
>   <mapreduce-user-thread.12345@hadoop.apache.org>
>
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
>
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=3D' instead of '@') after the command word:
> <mapreduce-user-subscribe-john=3Dhost.domain@hadoop.apache.org>
>
> To stop subscription for this address, mail:
> <mapreduce-user-unsubscribe-john=3Dhost.domain@hadoop.apache.org>
>
> In both cases, I'll send a confirmation message to that address. When
> you receive it, simply reply to it to complete your subscription.
>
> If despite following these instructions, you do not get the
> desired results, please contact my owner at
> mapreduce-user-owner@hadoop.apache.org. Please be patient, my owner is a
> lot slower than I am ;-)
>
> --- Enclosed is a copy of the request I received.
>
> Return-Path: <Satish.Setty@hcl.com>
> Received: (qmail 49524 invoked by uid 99); 5 Jan 2012 17:03:09 -0000
> Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
>    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 17:03:09
> +000=
> 0
> X-ASF-Spam-Status: No, hits=3D3.7 required=3D10.0
>        tests=3DASF_LIST_OPS,HTML_MESSAGE,MIME_HTML_ONLY,SPF_PASS
> X-Spam-Check-By: apache.org
> Received-SPF: pass (nike.apache.org: domain of Satish.Setty@hcl.comdesigna=
> tes 203.105.186.23 as permitted sender)
> Received: from [203.105.186.23] (HELO gws07.hcl.com) (203.105.186.23)
>    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jan 2012 17:03:02
> +000=
> 0
> Received: from chn-hclin-ht01.CORP.HCL.IN (10.249.64.35) by
>  CHN-HCLIN-EDGE3.HCL.COM (10.249.64.140) with Microsoft SMTP Server id
>  8.2.254.0; Thu, 5 Jan 2012 22:29:28 +0530
> Received: from CHN-HCLT-HT03.HCLT.CORP.HCL.IN (10.108.45.35) by
>  chn-hclin-ht01.CORP.HCL.IN (10.249.64.35) with Microsoft SMTP Server
> (TLS)=
>  id
>  8.2.254.0; Thu, 5 Jan 2012 22:32:35 +0530
> Received: from CHN-HCLT-EVS07.HCLT.CORP.HCL.IN([fe80::3d0d:efa3:3da8:2ae9]=
> )
>  by CHN-HCLT-HT03.HCLT.CORP.HCL.IN ([::1]) with mapi; Thu, 5 Jan 2012
> 22:32=
> :34
>  +0530
> From: "Satish Setty (HCL Financial Services)" <Satish.Setty@hcl.com>
> To: "mapreduce-user-subscribe@hadoop.apache.org"
>        <mapreduce-user-subscribe@hadoop.apache.org>
> Date: Thu, 5 Jan 2012 22:32:22 +0530
> Subject: hadoop
> Thread-Topic: hadoop
> Thread-Index: AQHMy8vRHiu5SOW+OEqHfqc++oiU+w=3D=3D
> Message-ID: <620012C16AC105498BB52AC8FD9745280265386D18@CHN-HCLT-EVS07.HCLT
> =
> .CORP.HCL.IN>
> Accept-Language: en-US
> Content-Language: en-US
> X-MS-Has-Attach:
> X-MS-TNEF-Correlator:
> acceptlanguage: en-US
> Content-Type: text/html; charset=3D"iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
> MIME-Version: 1.0
> X-Virus-Checked: Checked by ClamAV on apache.org
>
> <html dir=3D3D"ltr">
> <head>
> <meta http-equiv=3D3D"Content-Type" content=3D3D"text/html;
> charset=3D3Diso=
> -8859-=3D
> 1">
> <style title=3D3D"owaParaStyle"><!--P {
>        MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px
> }
> --></style>
> <meta content=3D3D"MSHTML 6.00.6000.17063" name=3D3D"GENERATOR">
> </head>
> <body ocsi=3D3D"x">
> <div dir=3D3D"ltr"><font face=3D3D"Tahoma" color=3D3D"#000000"
> size=3D3D"2"=
> >Hello,<=3D
> /font></div>
> <div>
> <div>
> <div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">We are trying to
> us=
> e Hado=3D
> op-0.20.203.0rc1 for parallel computation.&nbsp; Below are
> queries</font></=
> =3D
> div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">Assume single node
> =
> of hig=3D
> h configuration machine 8 cores and 8gb memory.</font></div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(a) How do we
> know&=
> nbsp;n=3D
> umber of &nbsp;map tasks spawned?&nbsp; Can this be controlled? We notice
> o=
> =3D
> nly 4 jvms running on a single node - namenode, datanode, jobtracker,
> taskt=
> =3D
> racker. As we understand depending on number of splits
>  that many map tasks are spawned - so we should see that many increase in
> j=
> =3D
> vms. </font>
> </div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(b) Our mapper
> clas=
> s shou=3D
> ld perform complex computations - it has plenty of dependent jars so how
> do=
> =3D
>  we add all jars in class path&nbsp; while running application? Since we
> re=
> =3D
> quire to perform parallel computations - we
>  need many map tasks running in parallel with different data. All are in
> sa=
> =3D
> me machine with different jvms.</font></div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(c) How does data
> s=
> plit h=3D
> appen?&nbsp; JobClient does not talk about data splits? As we understand
> we=
> =3D
>  create format for distributed file system, start-all.sh and then
> &quot;had=
> =3D
> oop fs -put&quot;. Do this write data to all datanodes?
>  But we are unable to see physical location? How does split happen from
> thi=
> =3D
> s hdfs source?</font></div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(d) Can we control
> =
> number=3D
>  of reduce tasks? Is this seperate jvm?&nbsp; How&nbsp;are&nbsp; optimal
> nu=
> =3D
> mbers&nbsp;for &nbsp;map&nbsp;and reduce tasks determined?</font></div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">(e) Any good
> docume=
> ntatio=3D
> n/links which speaks about namenode, datanode, jobtracker and
> tasktracker.<=
> =3D
> /font></div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2"></font>&nbsp;</div>
> <div dir=3D3D"ltr"><font face=3D3D"tahoma" size=3D3D"2">Thanks</font></div>
> </div>
> </div>
> </div>
> <br>
> <hr>
> <font face=3D3D"Arial" color=3D3D"Gray" size=3D3D"1">::DISCLAIMER::<br>
>
> ---------------------------------------------------------------------------=
> =3D
> --------------------------------------------<br>
> <br>
> The contents of this e-mail and any attachment(s) are confidential and
> inte=
> =3D
> nded for the named recipient(s) only.<br>
> It shall not attach any liability on the originator or HCL or its
> affiliate=
> =3D
> s. Any views or opinions presented in<br>
> this email are solely those of the author and may not necessarily reflect
> t=
> =3D
> he opinions of HCL or its affiliates.<br>
> Any form of reproduction, dissemination, copying, disclosure,
> modification,=
> =3D
>  distribution and / or publication of<br>
> this message without the prior written consent of the author of this
> e-mail=
> =3D
>  is strictly prohibited. If you have<br>
> received this email in error please delete it and notify the sender
> immedia=
> =3D
> tely. Before opening any mail and<br>
> attachments please check them for viruses and defect.<br>
> <br>
>
> ---------------------------------------------------------------------------=
> =3D
> --------------------------------------------<br>
> </font>
> </body>
> </html>=
>

Mime
View raw message