Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 031CA10AC4 for ; Sun, 25 Aug 2013 15:07:53 +0000 (UTC) Received: (qmail 40095 invoked by uid 500); 25 Aug 2013 15:07:47 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 39815 invoked by uid 500); 25 Aug 2013 15:07:47 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 39807 invoked by uid 99); 25 Aug 2013 15:07:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Aug 2013 15:07:46 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates 209.85.223.175 as permitted sender) Received: from [209.85.223.175] (HELO mail-ie0-f175.google.com) (209.85.223.175) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Aug 2013 15:07:40 +0000 Received: by mail-ie0-f175.google.com with SMTP id u16so3605407iet.34 for ; Sun, 25 Aug 2013 08:07:19 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type:content-transfer-encoding; bh=hgMyyz20W2eW8s25GseVIfoVv09iBGFO3rBJ6N6pyt8=; b=DwJymUhCne7O5RlTUhRXseGNBqNXNFOXyGk9PTNO6CYWwg29jAmzzWqr/yW8QY80gh 8J8ixP9UxwemyREx+iJ9zU12+u/xGYi+kInvaAVW005dryh5PQ4fdjCT3+hP2CZbkRTX 3hTXe7Qw93hxH60hJsXS6KIP7SpQz/HNpN/K0iK6R8yVFx3DtvbJBN9LJShR/AJ6Sl1Y hAUGiW0zh0TtgEOj6BBSzp4sgAc3vPIxgebIPp+CsV/e3RnfOhWaD7ZWffo/JIj9ENVJ Wbgk/pnD1f80O5FXRsEkRwoz6z8wCWLY40lgEGVfbd2yhw+sWS7UF43ZlwHHsfE58H45 ks8Q== X-Gm-Message-State: ALoCoQkdCiRp+zP19QYIJ0S+9IjvlBwINm0ofWK6dyizQPTLX03OLSO4vDvm23PHrKmnEMXEWiEk X-Received: by 10.43.66.11 with SMTP id xo11mr5034984icb.21.1377443239869; Sun, 25 Aug 2013 08:07:19 -0700 (PDT) MIME-Version: 1.0 Received: by 10.50.101.202 with HTTP; Sun, 25 Aug 2013 08:06:58 -0700 (PDT) In-Reply-To: References: From: Harsh J Date: Sun, 25 Aug 2013 20:36:58 +0530 Message-ID: Subject: Re: running map tasks in remote node To: "" Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org In a multi-node mode, MR requires a distributed filesystem (such as HDFS) to be able to run. On Sun, Aug 25, 2013 at 7:59 PM, rab ra wrote: > Dear Yong, > > Thanks for your elaborate answer. Your answer really make sense and I am > ending something close to it expect shared storage. > > In my usecase, I am not allowed to use any shared storage system. The rea= son > being that the slave nodes may not be safe for hosting sensible data. > (Because, they could belong to different enterprise, may be from cloud) I= do > agree that we still need this data on the slave node while doing processi= ng > and hence need to transfer the data from the enterprise node to the > processing nodes. But that's ok as this is better than using the slave no= des > for storage. If I can use shared storage then I could use hdfs itself. I > wrote simple example code with 2 node cluster setup and was testing vario= us > input formats such as WholeFileInputFormat, NLineInputFormat, > TextInputFormat. I faced issues when I do not want to use shared storage = as > I explained in my last email. I was thinking that having the input file i= n > the master node (job tracker) is sufficient and it will send portion of t= he > input file to the map process in the second node (slave). But this was no= t > the case as the method setInputPath() (and map reduce system) expect this > path is a shared one. All these my observations lead to straightforward > question that "Is map reduce system expect a shared storage system ? And > that input directories need to be present in that shared system? Is there= a > workaround for this issue?". Infact,I am prepared to use hdfs just for > convincing map reduce system and feed input to it. And for actual process= ing > I shall end up transferring the required data files to the slave nodes. > > I do note that I cannot enjoy the advantages that comes with hdfs such as > data replication, data location aware system etc. > > > with thanks and regards > rabmdu > > > > > > > > On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 > wrote: >> >> It is possible to do what you are trying to do, but only make sense if >> your MR job is very CPU intensive, and you want to use the CPU resource = in >> your cluster, instead of the IO. >> >> You may want to do some research about what is the HDFS's role in Hadoop= . >> First but not least, it provides a central storage for all the files wil= l be >> processed by MR jobs. If you don't want to use HDFS, so you need to >> identify a share storage to be shared among all the nodes in your cluste= r. >> HDFS is NOT required, but a shared storage is required in the cluster. >> >> For simply your question, let's just use NFS to replace HDFS. It is >> possible for a POC to help you understand how to set it up. >> >> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running = on >> NN, and TT running on DN). So instead of using HDFS, you can try to use = NFS >> by this way: >> >> 1) Mount /share_data in all of your 2 data nodes. They need to have the >> same mount. So /share_data in each data node point to the same NFS locat= ion. >> It doesn't matter where you host this NFS share, but just make sure each >> data node mount it as the same /share_data >> 2) Create a folder under /share_data, put all your data into that folder= . >> 3) When kick off your MR jobs, you need to give a full URL of the input >> path, like 'file:///shared_data/myfolder', also a full URL of the output >> path, like 'file:///shared_data/output'. In this way, each mapper will >> understand that in fact they will run the data from local file system, >> instead of HDFS. That's the reason you want to make sure each task node = has >> the same mount path, as 'file:///shared_data/myfolder' should work fine = for >> each task node. Check this and make sure that /share_data/myfolder all >> point to the same path in each of your task node. >> 4) You want each mapper to process one file, so instead of using the >> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make = sure >> that every file under '/share_data/myfolder' won't be split and sent to = the >> same mapper processor. >> 5) In the above set up, I don't think you need to start NameNode or >> DataNode process any more, anyway you just use JobTracker and TaskTracke= r. >> 6) Obviously when your data is big, the NFS share will be your bottlenec= k. >> So maybe you can replace it with Share Network Storage, but above set up >> gives you a start point. >> 7) Keep in mind when set up like above, you lost the Data Replication, >> Data Locality etc, that's why I said it ONLY makes sense if your MR job = is >> CPU intensive. You simple want to use the Mapper/Reducer tasks to proces= s >> your data, instead of any scalability of IO. >> >> Make sense? >> >> Yong >> >> ________________________________ >> Date: Fri, 23 Aug 2013 15:43:38 +0530 >> Subject: Re: running map tasks in remote node >> >> From: rabmdu@gmail.com >> To: user@hadoop.apache.org >> >> Thanks for the reply. >> >> I am basically exploring possible ways to work with hadoop framework for >> one of my use case. I have my limitations in using hdfs but agree with t= he >> fact that using map reduce in conjunction with hdfs makes sense. >> >> I successfully tested wholeFileInputFormat by some googling. >> >> Now, coming to my use case. I would like to keep some files in my master >> node and want to do some processing in the cloud nodes. The policy does = not >> allow us to configure and use cloud nodes as HDFS. However, I would lik= e to >> span a map process in those nodes. Hence, I set input path as local file >> system, for example, $HOME/inputs. I have a file listing filenames (10 >> lines) in this input directory. I use NLineInputFormat and span 10 map >> process. Each map process gets a line. The map process will then do a fi= le >> transfer and process it. However, I get an error in the map saying that= the >> FileNotFoundException $HOME/inputs. I am sure this directory is present = in >> my master but not in the slave nodes. When I copy this input directory t= o >> slave nodes, it works fine. I am not able to figure out how to fix this = and >> the reason for the error. I am not understand why it complains about the >> input directory is not present. As far as I know, slave nodes get a map = and >> map method contains contents of the input file. This should be fine for = the >> map logic to work. >> >> >> with regards >> rabmdu >> >> >> >> >> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 >> wrote: >> >> If you don't plan to use HDFS, what kind of sharing file system you are >> going to use between cluster? NFS? >> For what you want to do, even though it doesn't make too much sense, but >> you need to the first problem as the shared file system. >> >> Second, if you want to process the files file by file, instead of block = by >> block in HDFS, then you need to use the WholeFileInputFormat (google thi= s >> how to write one). So you don't need a file to list all the files to be >> processed, just put them into one folder in the sharing file system, the= n >> send this folder to your MR job. In this way, as long as each node can >> access it through some file system URL, each file will be processed in e= ach >> mapper. >> >> Yong >> >> ________________________________ >> Date: Wed, 21 Aug 2013 17:39:10 +0530 >> Subject: running map tasks in remote node >> From: rabmdu@gmail.com >> To: user@hadoop.apache.org >> >> >> Hello, >> >> Here is the new bie question of the day. >> >> For one of my use cases, I want to use hadoop map reduce without HDFS. >> Here, I will have a text file containing a list of file names to process= . >> Assume that I have 10 lines (10 files to process) in the input text file= and >> I wish to generate 10 map tasks and execute them in parallel in 10 nodes= . I >> started with basic tutorial on hadoop and could setup single node hadoop >> cluster and successfully tested wordcount code. >> >> Now, I took two machines A (master) and B (slave). I did the below >> configuration in these machines to setup a two node cluster. >> >> hdfs-site.xml >> >> >> >> >> >> >> dfs.replication >> 1 >> >> >> dfs.name.dir >> /tmp/hadoop-bala/dfs/name >> >> >> dfs.data.dir >> /tmp/hadoop-bala/dfs/data >> >> >> mapred.job.tracker >> A:9001 >> >> >> >> >> mapred-site.xml >> >> >> >> >> >> >> >> >> mapred.job.tracker >> A:9001 >> >> >> mapreduce.tasktracker.map.tasks.maximum >> 1 >> >> >> >> core-site.xml >> >> >> >> >> >> >> fs.default.name >> hdfs://A:9000 >> >> >> >> >> In A and B, I do have a file named =91slaves=92 with an entry =91B=92 in= it and >> another file called =91masters=92 wherein an entry =91A=92 is there. >> >> I have kept my input file at A. I see the map method process the input >> file line by line but they are all processed in A. Ideally, I would expe= ct >> those processing to take place in B. >> >> Can anyone highlight where I am going wrong? >> >> regards >> rab >> >> > --=20 Harsh J