hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Atish Kathpal <atish.kath...@gmail.com>
Subject Re: Running Hadoop v2 clustered mode MR on an NFS mounted filesystem
Date Wed, 08 Jan 2014 10:18:12 GMT
Figured out 1. The output of the reduce was going to the slave node, while
I was looking for it in the master node. Which is perfectly fine.
Need guidance for 2. though!

Thanks
Atish


On Wed, Jan 8, 2014 at 3:30 PM, Atish Kathpal <atish.kathpal@gmail.com>wrote:

> Hi
>
> By giving the complete URI, the MR jobs worked across both nodes. Thanks a
> lot for the advice.
>
> *Two issues though*:
> 1. On completion of the MR job, I see only the "_SUCCESS" file in the
> output directory, but no part-r file containing the actual results of the
> wordcount job. However I am seeing the correct output on running MR over
> HDFS. What is going wrong? Any place I can find logs for the MR job. I see
> no errors on the console.
> Command used:
> hadoop jar
> /home/hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
> wordcount file:///home/hduser/testmount/ file:///home/hduser/testresults/
>
>
> 2. I am observing that the mappers seem to be accessing files
> sequentially, splitting the files across mappers, and then reading data in
> parallelel, then moving on to the next file. What I want instead is that,
> files themselves should be accessed in parallel, that is, if there are 10
> files to be MRed, then MR should ask for each of these files in parallel in
> one go, and then work on the splits of these files in parallel.
> *Why do I need this?* Some of the data coming from the NFS mount point is
> coming from offline media (which takes ~5-10 seconds of time before first
> bytes are received). So I would like all required files to be asked at the
> onset itself from the NFS mount point. This way several offline media will
> be spun up parallely and as the data from these media gets available MR can
> process them.
>
> Would be glad to get inputs on these points!
>
> Thanks
> Atish
>
> Tip for those who are trying similar stuff::
> In my case. after a while the jobs would fail, complaining of "java.lang.OutOfMemoryError:
> Java heap space<http://stackoverflow.com/questions/13674190/cdh-4-1-error-running-child-java-lang-outofmemoryerror-java-heap-space>",
> but I was able to rectify this with help from:
> http://stackoverflow.com/questions/13674190/cdh-4-1-error-running-child-java-lang-outofmemoryerror-java-heap-space
>
>
>
>
>
> On Sun, Dec 22, 2013 at 2:47 PM, Atish Kathpal <atish.kathpal@gmail.com>wrote:
>
>> Thanks Devin, Yong, and Chris for your replies and suggestions. I will
>> test the suggestions made by Yong and Devin and get back to you guys.
>>
>> As on the bottlenecking issue, I agree, but  I am trying to run few MR
>> jobs on a traditional NAS server. I can live with a few bottlenecks, so
>> long as I don't have to move the data to a dedicated HDFS cluster.
>>
>>
>> On Sat, Dec 21, 2013 at 8:06 AM, Chris Mawata <chris.mawata@gmail.com>wrote:
>>
>>>  Yong raises an important issue:  You have thrown out the I/O
>>> advantages of HDFS and also thrown out the advantages of data locality. It
>>> would be interesting to know why you are taking this approach.
>>> Chris
>>>
>>>
>>> On 12/20/2013 9:28 AM, java8964 wrote:
>>>
>>> I believe the "-fs local" should be removed too. The reason is that even
>>> you have a dedicated JobTracker after removing "-jt local", but with "-fs
>>> local", I believe that all the mappers will be run sequentially.
>>>
>>>  "-fs local" will force the mapreducer run in "local" mode, which is
>>> really a test mode.
>>>
>>>  What you can do is to remove both "-fs local -jt local", but give the
>>> FULL URI of the input and output path, to tell Hadoop that they are local
>>> filesystem instead of HDFS.
>>>
>>>  "hadoop jar
>>> /hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
>>> wordcount file:///hduser/mount_point file:///results"
>>>
>>>  Keep in mind followings:
>>>
>>>  1) The NFS mount need to be available in all your Task Nodes, and
>>> mounted in the same way.
>>> 2) Even you can do that, but your sharing storage will be your
>>> bottleneck. NFS won't work well for scalability.
>>>
>>>  Yong
>>>
>>>  ------------------------------
>>> Date: Fri, 20 Dec 2013 09:01:32 -0500
>>> Subject: Re: Running Hadoop v2 clustered mode MR on an NFS mounted
>>> filesystem
>>> From: dsuiter@rdx.com
>>> To: user@hadoop.apache.org
>>>
>>> I think most of your problem is coming from the options you are setting:
>>>
>>>  "hadoop jar
>>> /hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
>>> wordcount *-fs local -jt local* /hduser/mount_point/  /results"
>>>
>>>  You appear to be directing your namenode to run jobs in the *LOCAL* job
>>> runner and directing it to read from the *LOCAL* filesystem. Drop the
>>> *-jt* argument and it should run in distributed mode if your cluster is
>>> set up right. You don't need to do anything special to point Hadoop towards
>>> a NFS location, other than set up the NFS location properly and make sure
>>> if you are directing to it by name that it will resolve to the right
>>> address. Hadoop doesn't care where it is, as long as it can read from and
>>> write to it. The fact that you are telling it to read/write from/to a NFS
>>> location that happens to be mounted as a local filesystem object doesn't
>>> matter - you could direct it to the local /hduser/ path and set the -fs
>>> local option, and it would end up on the NFS mount, because that's where
>>> the NFS mount actually exists, or you could direct it to the absolute
>>> network location of the folder that you want, it shouldn't make a
>>> difference.
>>>
>>>  *Devin Suiter*
>>> Jr. Data Solutions Software Engineer
>>>   100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>> Google Voice: 412-256-8556 | www.rdx.com
>>>
>>>
>>> On Fri, Dec 20, 2013 at 5:27 AM, Atish Kathpal <atish.kathpal@gmail.com>wrote:
>>>
>>> Hello
>>>
>>>  The picture below describes the deployment architecture I am trying to
>>> achieve.
>>> However, when I run the wordcount example code with the below
>>> configuration, by issuing the command from the master node, I notice only
>>> the master node spawning map tasks and completing the submitted job. Below
>>> is the command I used:
>>>
>>>  *hadoop jar
>>> /hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
>>> wordcount -fs local -jt local /hduser/mount_point/  /results*
>>>
>>>  *Question: How can I leverage both the hadoop nodes for running MR,
>>> while serving my data from the common NFS mount point running my filesystem
>>> at the backend? Has any one tried such a setup before?*
>>> [image: Inline image 1]
>>>
>>>  Thanks!
>>>
>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message