Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 97DD51012F for ; Fri, 23 Aug 2013 14:11:48 +0000 (UTC) Received: (qmail 36053 invoked by uid 500); 23 Aug 2013 14:11:42 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 35710 invoked by uid 500); 23 Aug 2013 14:11:41 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 35696 invoked by uid 99); 23 Aug 2013 14:11:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Aug 2013 14:11:41 +0000 X-ASF-Spam-Status: No, hits=2.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of java8964@hotmail.com designates 65.55.111.86 as permitted sender) Received: from [65.55.111.86] (HELO blu0-omc2-s11.blu0.hotmail.com) (65.55.111.86) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Aug 2013 14:11:35 +0000 Received: from BLU162-W7 ([65.55.111.72]) by blu0-omc2-s11.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Fri, 23 Aug 2013 07:11:14 -0700 X-TMN: [18tf/b93KgkN2I5KLtyXICyDBvyNpbBg6MubMhpQI/E=] X-Originating-Email: [java8964@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="_e075a4df-ba02-4351-a48d-fbee3b583b1b_" From: java8964 java8964 To: "user@hadoop.apache.org" Subject: RE: running map tasks in remote node Date: Fri, 23 Aug 2013 10:11:14 -0400 Importance: Normal In-Reply-To: References: ,, MIME-Version: 1.0 X-OriginalArrivalTime: 23 Aug 2013 14:11:14.0929 (UTC) FILETIME=[A0DD2E10:01CEA00A] X-Virus-Checked: Checked by ClamAV on apache.org --_e075a4df-ba02-4351-a48d-fbee3b583b1b_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable It is possible to do what you are trying to do=2C but only make sense if yo= ur MR job is very CPU intensive=2C and you want to use the CPU resource in = your cluster=2C instead of the IO. You may want to do some research about what is the HDFS's role in Hadoop. F= irst but not least=2C it provides a central storage for all the files will = be processed by MR jobs. If you don't want to use HDFS=2C so you need to i= dentify a share storage to be shared among all the nodes in your cluster. H= DFS is NOT required=2C but a shared storage is required in the cluster. For simply your question=2C let's just use NFS to replace HDFS. It is possi= ble for a POC to help you understand how to set it up. Assume your have a cluster with 3 nodes (one NN=2C two DN. The JT running o= n NN=2C and TT running on DN). So instead of using HDFS=2C you can try to u= se NFS by this way: 1) Mount /share_data in all of your 2 data nodes. They need to have the sam= e mount. So /share_data in each data node point to the same NFS location. I= t doesn't matter where you host this NFS share=2C but just make sure each d= ata node mount it as the same /share_data2) Create a folder under /share_da= ta=2C put all your data into that folder.3) When kick off your MR jobs=2C y= ou need to give a full URL of the input path=2C like 'file:///shared_data/m= yfolder'=2C also a full URL of the output path=2C like 'file:///shared_data= /output'. In this way=2C each mapper will understand that in fact they will= run the data from local file system=2C instead of HDFS. That's the reason = you want to make sure each task node has the same mount path=2C as 'file://= /shared_data/myfolder' should work fine for each task node. Check this and= make sure that /share_data/myfolder all point to the same path in each of = your task node.4) You want each mapper to process one file=2C so instead of= using the default 'TextInputFormat'=2C use a 'WholeFileInputFormat'=2C thi= s will make sure that every file under '/share_data/myfolder' won't be spli= t and sent to the same mapper processor. 5) In the above set up=2C I don't = think you need to start NameNode or DataNode process any more=2C anyway you= just use JobTracker and TaskTracker.6) Obviously when your data is big=2C = the NFS share will be your bottleneck. So maybe you can replace it with Sha= re Network Storage=2C but above set up gives you a start point.7) Keep in m= ind when set up like above=2C you lost the Data Replication=2C Data Localit= y etc=2C that's why I said it ONLY makes sense if your MR job is CPU intens= ive. You simple want to use the Mapper/Reducer tasks to process your data= =2C instead of any scalability of IO. Make sense? Yong Date: Fri=2C 23 Aug 2013 15:43:38 +0530 Subject: Re: running map tasks in remote node From: rabmdu@gmail.com To: user@hadoop.apache.org Thanks for the reply.=20 I am basically exploring possible ways to work with hadoop framework for on= e of my use case. I have my limitations in using hdfs but agree with the fa= ct that using map reduce in conjunction with hdfs makes sense. =0A= I successfully tested wholeFileInputFormat by some googling.=20 Now=2C coming to my use case. I would like to keep some files in my master = node and want to do some processing in the cloud nodes. The policy does not= allow us to configure and use cloud nodes as HDFS. However=2C I would lik= e to span a map process in those nodes. Hence=2C I set input path as local = file system=2C for example=2C $HOME/inputs. I have a file listing filenames= (10 lines) in this input directory. I use NLineInputFormat and span 10 ma= p process. Each map process gets a line. The map process will then do a fil= e transfer and process it. However=2C I get an error in the map saying tha= t the FileNotFoundException $HOME/inputs. I am sure this directory is prese= nt in my master but not in the slave nodes. When I copy this input director= y to slave nodes=2C it works fine. I am not able to figure out how to fix t= his and the reason for the error. I am not understand why it complains abou= t the input directory is not present. As far as I know=2C slave nodes get a= map and map method contains contents of the input file. This should be fin= e for the map logic to work.=0A= with regardsrabmdu On Thu=2C Aug 22=2C 2013 at 4:40 PM=2C java8964 java8964 wrote: =0A= =0A= =0A= =0A= If you don't plan to use HDFS=2C what kind of sharing file system you are g= oing to use between cluster? NFS?For what you want to do=2C even though it = doesn't make too much sense=2C but you need to the first problem as the sha= red file system.=0A= Second=2C if you want to process the files file by file=2C instead of block= by block in HDFS=2C then you need to use the WholeFileInputFormat (google = this how to write one). So you don't need a file to list all the files to b= e processed=2C just put them into one folder in the sharing file system=2C = then send this folder to your MR job. In this way=2C as long as each node c= an access it through some file system URL=2C each file will be processed in= each mapper.=0A= Yong Date: Wed=2C 21 Aug 2013 17:39:10 +0530 Subject: running map tasks in remote node From: rabmdu@gmail.com To: user@hadoop.apache.org=0A= Hello=2C =0A= =0A= Here is the new bie question of the day. For one of my use cases=2C I want = to use hadoop map reduce without HDFS. Here=2C I will have a text file cont= aining a list of file names to process. Assume that I have 10 lines (10 fil= es to process) in the input text file and I wish to generate 10 map tasks a= nd execute them in parallel in 10 nodes. I started with basic tutorial on h= adoop and could setup single node hadoop cluster and successfully tested wo= rdcount code.=0A= =0A= Now=2C I took two machines A (master) and B (slave). I did the below confi= guration in these machines to setup a two node cluster.=0A= =0A= hdfs-site.xml=0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= dfs.replication 1=0A= =0A= =0A= =0A= dfs.name.dir /tmp/hadoop-bala/dfs/name=0A= =0A= =0A= =0A= dfs.data.dir /tmp/hadoop-bala/dfs/data=0A= =0A= =0A= =0A= mapred.job.tracker A:9001=0A= =0A= =0A= =0A= mapred-site.xml=0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= mapred.job.tracker A:9001=0A= =0A= =0A= =0A= mapreduce.tasktracker.map.tasks.maximum=0A= 1=0A= =0A= =0A= core-site.xml =0A= =0A= =0A= =0A= = =0A= =0A= fs.default.name=0A= =0A= hdfs://A:9000 =0A= =0A= =0A= =0A= In A and B=2C I do have a file named =91slaves=92 with an entry =91B=92 in= it and another file called =91masters=92 wherein an entry =91A=92 is there= .=0A= =0A= I have kept my input file at A. I see the map method process the input fil= e line by line but they are all processed in A. Ideally=2C I would expect t= hose processing to take place in B.=0A= =0A= Can anyone highlight where I am going wrong?=0A= =0A= regardsrab=0A= =0A= = --_e075a4df-ba02-4351-a48d-fbee3b583b1b_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable
It is possible to do what you ar= e trying to do=2C but only make sense if your MR job is very CPU intensive= =2C and you want to use the CPU resource in your cluster=2C instead of the = IO.

You may want to do some research about what is the H= DFS's role in Hadoop. First but not least=2C it provides a central storage = for all the files will be processed by MR jobs. If you don't want to use HD= FS=2C so you need to  =3Bidentify a share storage to be shared among al= l the nodes in your cluster. HDFS is NOT required=2C but a shared storage i= s required in the cluster.

For simply your questio= n=2C let's just use NFS to replace HDFS. It is possible for a POC to help y= ou understand how to set it up.

Assume your have a= cluster with 3 nodes (one NN=2C two DN. The JT running on NN=2C and TT run= ning on DN). So instead of using HDFS=2C you can try to use NFS by this way= :

1) Mount /share_data in all of your 2 data nodes= . They need to have the same mount. So /share_data in each data node point = to the same NFS location. It doesn't matter where you host this NFS share= =2C but just make sure each data node mount it as the same /share_data
2) Create a folder under /share_data=2C put all your data into that f= older.
3) When kick off your MR jobs=2C you need to give a full U= RL of the input path=2C like 'file:///shared_data/myfolder'=2C also a full = URL of the output path=2C like 'file:///shared_data/output'. In this way=2C= each mapper will understand that in fact they will run the data from local= file system=2C instead of HDFS. That's the reason you want to make sure ea= ch task node has the same mount path=2C as 'file:///shared_data/myfolder' s= hould work fine for each  =3Btask node. Check this and make sure that /= share_data/myfolder all point to the same path in each of your task node.
4) You want each mapper to process one file=2C so instead of using= the default 'TextInputFormat'=2C use a 'WholeFileInputFormat'=2C this will= make sure that every file under '/share_data/myfolder' won't be split and = sent to the same mapper processor. =3B
5) In the above set up= =2C I don't think you need to start NameNode or DataNode process any more= =2C anyway you just use JobTracker and TaskTracker.
6) Obviously = when your data is big=2C the NFS share will be your bottleneck. So maybe yo= u can replace it with Share Network Storage=2C but above set up gives you a= start point.
7) Keep in mind when set up like above=2C you lost = the Data Replication=2C Data Locality etc=2C that's why I said it ONLY make= s sense if your MR job is CPU intensive. You simple want to use the Mapper/= Reducer tasks to process your data=2C instead of any scalability of IO.

Make sense?

Yong


Date: Fri=2C 23 Aug 2013 15:43:38 +0530
Subject= : Re: running map tasks in remote node
From: rabmdu@gmail.com
To: use= r@hadoop.apache.org

Thanks for the reply. =3B
I am basically exploring possible ways to work with hadoo= p framework for one of my use case. I have my limitations in using hdfs but= agree with the fact that using map reduce in conjunction with hdfs makes s= ense.  =3B
=0A=

I successfully tested wholeFileInputFormat by some goog= ling. =3B

Now=2C coming to my use case. I woul= d like to keep some files in my master node and want to do some processing = in the cloud nodes. The policy does not allow us to configure and use cloud= nodes as HDFS.  =3BHowever=2C I would like to span a map process in th= ose nodes. Hence=2C I set input path as local file system=2C for example=2C= $HOME/inputs. I have a file listing filenames (10 lines) in this input dir= ectory.  =3BI use NLineInputFormat and span 10 map process. Each map pr= ocess gets a line. The map process will then do a file transfer and process= it.  =3BHowever=2C I get an error in the map saying that the FileNotFo= undException $HOME/inputs. I am sure this directory is present in my master= but not in the slave nodes. When I copy this input directory to slave node= s=2C it works fine. I am not able to figure out how to fix this and the rea= son for the error. I am not understand why it complains about the input dir= ectory is not present. As far as I know=2C slave nodes get a map and map me= thod contains contents of the input file. This should be fine for the map l= ogic to work.
=0A=


with regards
rabmdu
=



On Thu=2C Aug 22=2C 2013 at 4:40 PM=2C java8964 jav= a8964 <=3Bjava8964@hotmail.com>=3B wrote:
=0A=
=0A= =0A= =0A=
If you don't plan to use HDFS=2C what kind of sharing= file system you are going to use between cluster? NFS?
For what you wa= nt to do=2C even though it doesn't make too much sense=2C but you need to t= he first problem as the shared file system.
=0A=

Second=2C if you want to process the files file by file= =2C instead of block by block in HDFS=2C then you need to use the WholeFile= InputFormat (google this how to write one). So you don't need a file to lis= t all the files to be processed=2C just put them into one folder in the sha= ring file system=2C then send this folder to your MR job. In this way=2C as= long as each node can access it through some file system URL=2C each file = will be processed in each mapper.
=0A=

Yong


Date: Wed=2C 21 Aug 2013 17:39:10 = +0530
Subject: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org=
=0A=


Hello=2C
=
 =3B
=0A=
=0A= Here is the new bie question of the day.=
 =3B
For one of my use = cases=2C I want to use hadoop map reduce without HDFS. Here=2C I will have = a text file containing a list of file names to process. Assume that I have = 10 lines (10 files to process) in the input text file and I wish to generat= e 10 map tasks and execute them in parallel in 10 nodes. I started with bas= ic tutorial on hadoop and could setup single node hadoop cluster and succes= sfully tested wordcount code.
=0A= =0A=
 =3B
Now=2C I took two machines A (ma= ster) and B (slave). I did the below configuration in these machines to set= up a two node cluster.
=0A= =0A=
 =3B
hdfs-site.xml
=0A= =0A=  =3B
<=3B?xml version=3D"1.0"?>=3B
=0A= =0A= <=3B?xml-stylesheet ty= pe=3D"text/xsl" href=3D"configuration.xsl"?>=3B
<=3B!-- Put site-specific property overrides in this file. = -->=3B
=0A= =0A=
<=3Bconfiguration>=3B
<=3Bproperty>=3B
=0A= =0A=
 =3B =3B =3B =3B =3B =3B=  =3B =3B =3B <=3Bname>=3Bdfs.replication<=3B/name>=3B
 =3B =3B =3B =3B = =3B =3B =3B =3B =3B <=3Bvalue>=3B1<=3B/value>=3B
=0A= =0A=
<=3B/property>=3B
<=3Bproperty>=3B
=0A= =0A=
 =3B <=3Bname>=3Bdfs.name.dir<=3B/name= >=3B
 =3B <=3Bvalue>=3B/tm= p/hadoop-bala/dfs/name<=3B/value>=3B
=0A= =0A=
<=3B/property>=3B
<=3Bproperty>=3B
=0A= =0A=
 =3B <=3Bname>=3Bdfs.data.dir<=3B/name= >=3B
 =3B <=3Bvalue>=3B/tm= p/hadoop-bala/dfs/data<=3B/value>=3B
=0A= =0A=
<=3B/property>=3B
<=3Bproperty>=3B
=0A= =0A=
 =3B =3B =3B =3B <=3Bname>= =3Bmapred.job.tracker<=3B/name>=3B
 =3B =3B =3B <=3Bvalue>=3BA:9001<=3B/value>=3B
=0A= =0A=
<=3B/property>=3B
 =3B
=0A= =0A=
<=3B/configuration>=3B
 =3B
mapred-s= ite.xml
=0A= =0A=
 =3B
<=3B?xml version=3D"1.0"?>=3B
=0A= =0A= <=3B?xml-stylesheet ty= pe=3D"text/xsl" href=3D"configuration.xsl"?>=3B
 =3B
=0A= =0A=
<=3B!-- Put site-specific property overrides i= n this file. -->=3B
 =3B
=0A= =0A=
<=3Bconfiguration>=3B
<=3Bproperty>=3B
=0A= =0A=
 =3B =3B =3B =3B =3B =3B=  =3B =3B =3B =3B =3B <=3Bname>=3Bmapred.job.tracker= <=3B/name>=3B
 =3B =3B&n= bsp=3B =3B =3B =3B =3B =3B =3B =3B =3B <= =3Bvalue>=3BA:9001<=3B/value>=3B
=0A= =0A=
<=3B/property>=3B
<=3Bproperty>=3B
=0A= =0A=
 =3B =3B =3B =3B =3B =3B=  =3B =3B =3B <=3Bname>=3Bmapreduce.tasktracker.map.tasks.ma= ximum<=3B/name>=3B
=0A=  =3B =3B =3B=  =3B =3B =3B =3B =3B =3B =3B <=3Bvalue>=3B1= <=3B/value>=3B
=0A=
<=3B/property>=3B
<=3B/configuration>=3B
=0A= =0A=
 =3B
core-site.xml
 =3B
=0A= =0A=
<=3B?xml version=3D"1.0"?>=3B
<=3B?xml-stylesheet type=3D"text/xsl" href=3D"c= onfiguration.xsl"?>=3B
=0A= =0A=
<=3B!-- Put site-specific property overrides i= n this file. -->=3B
<=3Bconfigur= ation>=3B
=0A= =0A=
 =3B =3B =3B =3B =3B =3B=  =3B =3B <=3Bproperty>=3B
 =3B =3B =3B =3B =3B =3B =3B =3B =3B&= nbsp=3B =3B =3B =3B =3B =3B <=3Bname>=3Bfs.default.name<=3B/name&g= t=3B
=0A= =0A=
 =3B =3B =3B =3B =3B =3B=  =3B =3B =3B =3B =3B =3B =3B =3B =3B &l= t=3Bvalue>=3Bhdfs://A:9000<=3B/value>=3B
 =3B =3B =3B =3B =3B =3B =3B <=3B/= property>=3B
=0A= =0A=
<=3B/configuration>=3B
 =3B
=0A= =0A=
 =3B
In A and B=2C I = do have a file named =91slaves=92 with an entry =91B=92 in it and another f= ile called =91masters=92 wherein an entry =91A=92 is there.
=0A= =0A=
 =3B
I have kept my input file at A. = I see the map method process the input file line by line but they are all p= rocessed in A. Ideally=2C I would expect those processing to take place in = B.
=0A= =0A=
 =3B
Can anyone highlight where I am = going wrong?
=0A= =0A=
 =3B
 =3Bregards
rab
=0A=
=0A=

= --_e075a4df-ba02-4351-a48d-fbee3b583b1b_--