Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates
 209.85.223.175 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CANXCz3TkxgyYOyWLX8DNRXXRRCmAA+kty09NhwXqUazdi5NEMw@mail.gmail.com>
References: 
 <CANXCz3SPqFG1z4Amt-9+EVDXVLPGD9Lcp6XmkJP9vzpeU=SQQw@mail.gmail.com>
 <BLU162-W150F039777AFD6C330F1D6D04D0@phx.gbl>
 <CANXCz3Q0Dkyyt=LKEeJXUypkyws7foLAqTSHkxitKi587yFQfA@mail.gmail.com>
 <BLU162-W7F0D37C93922DC29BC97BD04E0@phx.gbl>
 <CANXCz3TkxgyYOyWLX8DNRXXRRCmAA+kty09NhwXqUazdi5NEMw@mail.gmail.com>
From: Harsh J <harsh@cloudera.com>
Date: Sun, 25 Aug 2013 20:36:58 +0530
Message-ID: 
 <CAOcnVr2gW0z2wckhbr=qSqPZjmyDUWa8vt5sBvxep+-rDOb5UQ@mail.gmail.com>
Subject: Re: running map tasks in remote node
To: "<user@hadoop.apache.org>" <user@hadoop.apache.org>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

In a multi-node mode, MR requires a distributed filesystem (such as
HDFS) to be able to run.

On Sun, Aug 25, 2013 at 7:59 PM, rab ra <rabmdu@gmail.com> wrote:
> Dear Yong,
>
> Thanks for your elaborate answer. Your answer really make sense and I am
> ending something close to it expect shared storage.
>
> In my usecase, I am not allowed to use any shared storage system. The rea=
son
> being that the slave nodes may not be safe for hosting sensible data.
> (Because, they could belong to different enterprise, may be from cloud) I=
 do
> agree that we still need this data on the slave node while doing processi=
ng
> and hence need to transfer the data from the enterprise node to the
> processing nodes. But that's ok as this is better than using the slave no=
des
> for storage. If I can use shared storage then I could use hdfs itself. I
> wrote simple example code with 2 node cluster setup and was testing vario=
us
> input formats such as WholeFileInputFormat, NLineInputFormat,
> TextInputFormat. I faced issues when I do not want to use shared storage =
as
> I explained in my last email. I was thinking that having the input file i=
n
> the master node (job tracker) is sufficient and it will send portion of t=
he
> input file to the map process in the second node (slave). But this was no=
t
> the case as the method setInputPath() (and map reduce system) expect this
> path is a shared one.  All these my observations lead to straightforward
> question that "Is map reduce system expect a shared storage system ? And
> that input directories need to be present in that shared system? Is there=
 a
> workaround for this issue?". Infact,I am prepared to use hdfs just for
> convincing map reduce system and feed input to it. And for actual process=
ing
> I shall end up transferring the required data files to the slave nodes.
>
> I do note that I cannot enjoy the advantages that comes with hdfs such as
> data replication, data location aware system etc.
>
>
> with thanks and regards
> rabmdu
>
>
>
>
>
>
>
> On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 <java8964@hotmail.com>
> wrote:
>>
>> It is possible to do what you are trying to do, but only make sense if
>> your MR job is very CPU intensive, and you want to use the CPU resource =
in
>> your cluster, instead of the IO.
>>
>> You may want to do some research about what is the HDFS's role in Hadoop=
.
>> First but not least, it provides a central storage for all the files wil=
l be
>> processed by MR jobs. If you don't want to use HDFS, so you need to
>> identify a share storage to be shared among all the nodes in your cluste=
r.
>> HDFS is NOT required, but a shared storage is required in the cluster.
>>
>> For simply your question, let's just use NFS to replace HDFS. It is
>> possible for a POC to help you understand how to set it up.
>>
>> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running =
on
>> NN, and TT running on DN). So instead of using HDFS, you can try to use =
NFS
>> by this way:
>>
>> 1) Mount /share_data in all of your 2 data nodes. They need to have the
>> same mount. So /share_data in each data node point to the same NFS locat=
ion.
>> It doesn't matter where you host this NFS share, but just make sure each
>> data node mount it as the same /share_data
>> 2) Create a folder under /share_data, put all your data into that folder=
.
>> 3) When kick off your MR jobs, you need to give a full URL of the input
>> path, like 'file:///shared_data/myfolder', also a full URL of the output
>> path, like 'file:///shared_data/output'. In this way, each mapper will
>> understand that in fact they will run the data from local file system,
>> instead of HDFS. That's the reason you want to make sure each task node =
has
>> the same mount path, as 'file:///shared_data/myfolder' should work fine =
for
>> each  task node. Check this and make sure that /share_data/myfolder all
>> point to the same path in each of your task node.
>> 4) You want each mapper to process one file, so instead of using the
>> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make =
sure
>> that every file under '/share_data/myfolder' won't be split and sent to =
the
>> same mapper processor.
>> 5) In the above set up, I don't think you need to start NameNode or
>> DataNode process any more, anyway you just use JobTracker and TaskTracke=
r.
>> 6) Obviously when your data is big, the NFS share will be your bottlenec=
k.
>> So maybe you can replace it with Share Network Storage, but above set up
>> gives you a start point.
>> 7) Keep in mind when set up like above, you lost the Data Replication,
>> Data Locality etc, that's why I said it ONLY makes sense if your MR job =
is
>> CPU intensive. You simple want to use the Mapper/Reducer tasks to proces=
s
>> your data, instead of any scalability of IO.
>>
>> Make sense?
>>
>> Yong
>>
>> ________________________________
>> Date: Fri, 23 Aug 2013 15:43:38 +0530
>> Subject: Re: running map tasks in remote node
>>
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>> Thanks for the reply.
>>
>> I am basically exploring possible ways to work with hadoop framework for
>> one of my use case. I have my limitations in using hdfs but agree with t=
he
>> fact that using map reduce in conjunction with hdfs makes sense.
>>
>> I successfully tested wholeFileInputFormat by some googling.
>>
>> Now, coming to my use case. I would like to keep some files in my master
>> node and want to do some processing in the cloud nodes. The policy does =
not
>> allow us to configure and use cloud nodes as HDFS.  However, I would lik=
e to
>> span a map process in those nodes. Hence, I set input path as local file
>> system, for example, $HOME/inputs. I have a file listing filenames (10
>> lines) in this input directory.  I use NLineInputFormat and span 10 map
>> process. Each map process gets a line. The map process will then do a fi=
le
>> transfer and process it.  However, I get an error in the map saying that=
 the
>> FileNotFoundException $HOME/inputs. I am sure this directory is present =
in
>> my master but not in the slave nodes. When I copy this input directory t=
o
>> slave nodes, it works fine. I am not able to figure out how to fix this =
and
>> the reason for the error. I am not understand why it complains about the
>> input directory is not present. As far as I know, slave nodes get a map =
and
>> map method contains contents of the input file. This should be fine for =
the
>> map logic to work.
>>
>>
>> with regards
>> rabmdu
>>
>>
>>
>>
>> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <java8964@hotmail.com=
>
>> wrote:
>>
>> If you don't plan to use HDFS, what kind of sharing file system you are
>> going to use between cluster? NFS?
>> For what you want to do, even though it doesn't make too much sense, but
>> you need to the first problem as the shared file system.
>>
>> Second, if you want to process the files file by file, instead of block =
by
>> block in HDFS, then you need to use the WholeFileInputFormat (google thi=
s
>> how to write one). So you don't need a file to list all the files to be
>> processed, just put them into one folder in the sharing file system, the=
n
>> send this folder to your MR job. In this way, as long as each node can
>> access it through some file system URL, each file will be processed in e=
ach
>> mapper.
>>
>> Yong
>>
>> ________________________________
>> Date: Wed, 21 Aug 2013 17:39:10 +0530
>> Subject: running map tasks in remote node
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hello,
>>
>> Here is the new bie question of the day.
>>
>> For one of my use cases, I want to use hadoop map reduce without HDFS.
>> Here, I will have a text file containing a list of file names to process=
.
>> Assume that I have 10 lines (10 files to process) in the input text file=
 and
>> I wish to generate 10 map tasks and execute them in parallel in 10 nodes=
. I
>> started with basic tutorial on hadoop and could setup single node hadoop
>> cluster and successfully tested wordcount code.
>>
>> Now, I took two machines A (master) and B (slave). I did the below
>> configuration in these machines to setup a two node cluster.
>>
>> hdfs-site.xml
>>
>> <?xml version=3D"1.0"?>
>> <?xml-stylesheet type=3D"text/xsl" href=3D"configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>> <property>
>>           <name>dfs.replication</name>
>>           <value>1</value>
>> </property>
>> <property>
>>   <name>dfs.name.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/name</value>
>> </property>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/data</value>
>> </property>
>> <property>
>>      <name>mapred.job.tracker</name>
>>     <value>A:9001</value>
>> </property>
>>
>> </configuration>
>>
>> mapred-site.xml
>>
>> <?xml version=3D"1.0"?>
>> <?xml-stylesheet type=3D"text/xsl" href=3D"configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>             <name>mapred.job.tracker</name>
>>             <value>A:9001</value>
>> </property>
>> <property>
>>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>>            <value>1</value>
>> </property>
>> </configuration>
>>
>> core-site.xml
>>
>> <?xml version=3D"1.0"?>
>> <?xml-stylesheet type=3D"text/xsl" href=3D"configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>>          <property>
>>                 <name>fs.default.name</name>
>>                 <value>hdfs://A:9000</value>
>>         </property>
>> </configuration>
>>
>>
>> In A and B, I do have a file named =91slaves=92 with an entry =91B=92 in=
 it and
>> another file called =91masters=92 wherein an entry =91A=92 is there.
>>
>> I have kept my input file at A. I see the map method process the input
>> file line by line but they are all processed in A. Ideally, I would expe=
ct
>> those processing to take place in B.
>>
>> Can anyone highlight where I am going wrong?
>>
>>  regards
>> rab
>>
>>
>


--=20
Harsh J