hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Atish Kathpal <atish.kath...@gmail.com>
Subject Re: Running Hadoop v2 clustered mode MR on an NFS mounted filesystem
Date Wed, 08 Jan 2014 10:00:45 GMT
Hi

By giving the complete URI, the MR jobs worked across both nodes. Thanks a
lot for the advice.

*Two issues though*:
1. On completion of the MR job, I see only the "_SUCCESS" file in the
output directory, but no part-r file containing the actual results of the
wordcount job. However I am seeing the correct output on running MR over
HDFS. What is going wrong? Any place I can find logs for the MR job. I see
no errors on the console.
Command used:
hadoop jar
/home/hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
wordcount file:///home/hduser/testmount/ file:///home/hduser/testresults/


2. I am observing that the mappers seem to be accessing files sequentially,
splitting the files across mappers, and then reading data in parallelel,
then moving on to the next file. What I want instead is that, files
themselves should be accessed in parallel, that is, if there are 10 files
to be MRed, then MR should ask for each of these files in parallel in one
go, and then work on the splits of these files in parallel.
*Why do I need this?* Some of the data coming from the NFS mount point is
coming from offline media (which takes ~5-10 seconds of time before first
bytes are received). So I would like all required files to be asked at the
onset itself from the NFS mount point. This way several offline media will
be spun up parallely and as the data from these media gets available MR can
process them.

Would be glad to get inputs on these points!

Thanks
Atish

Tip for those who are trying similar stuff::
In my case. after a while the jobs would fail, complaining of
"java.lang.OutOfMemoryError:
Java heap space<http://stackoverflow.com/questions/13674190/cdh-4-1-error-running-child-java-lang-outofmemoryerror-java-heap-space>",
but I was able to rectify this with help from:
http://stackoverflow.com/questions/13674190/cdh-4-1-error-running-child-java-lang-outofmemoryerror-java-heap-space





On Sun, Dec 22, 2013 at 2:47 PM, Atish Kathpal <atish.kathpal@gmail.com>wrote:

> Thanks Devin, Yong, and Chris for your replies and suggestions. I will
> test the suggestions made by Yong and Devin and get back to you guys.
>
> As on the bottlenecking issue, I agree, but  I am trying to run few MR
> jobs on a traditional NAS server. I can live with a few bottlenecks, so
> long as I don't have to move the data to a dedicated HDFS cluster.
>
>
> On Sat, Dec 21, 2013 at 8:06 AM, Chris Mawata <chris.mawata@gmail.com>wrote:
>
>>  Yong raises an important issue:  You have thrown out the I/O advantages
>> of HDFS and also thrown out the advantages of data locality. It would be
>> interesting to know why you are taking this approach.
>> Chris
>>
>>
>> On 12/20/2013 9:28 AM, java8964 wrote:
>>
>> I believe the "-fs local" should be removed too. The reason is that even
>> you have a dedicated JobTracker after removing "-jt local", but with "-fs
>> local", I believe that all the mappers will be run sequentially.
>>
>>  "-fs local" will force the mapreducer run in "local" mode, which is
>> really a test mode.
>>
>>  What you can do is to remove both "-fs local -jt local", but give the
>> FULL URI of the input and output path, to tell Hadoop that they are local
>> filesystem instead of HDFS.
>>
>>  "hadoop jar
>> /hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
>> wordcount file:///hduser/mount_point file:///results"
>>
>>  Keep in mind followings:
>>
>>  1) The NFS mount need to be available in all your Task Nodes, and
>> mounted in the same way.
>> 2) Even you can do that, but your sharing storage will be your
>> bottleneck. NFS won't work well for scalability.
>>
>>  Yong
>>
>>  ------------------------------
>> Date: Fri, 20 Dec 2013 09:01:32 -0500
>> Subject: Re: Running Hadoop v2 clustered mode MR on an NFS mounted
>> filesystem
>> From: dsuiter@rdx.com
>> To: user@hadoop.apache.org
>>
>> I think most of your problem is coming from the options you are setting:
>>
>>  "hadoop jar
>> /hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
>> wordcount *-fs local -jt local* /hduser/mount_point/  /results"
>>
>>  You appear to be directing your namenode to run jobs in the *LOCAL* job
>> runner and directing it to read from the *LOCAL* filesystem. Drop the
>> *-jt* argument and it should run in distributed mode if your cluster is
>> set up right. You don't need to do anything special to point Hadoop towards
>> a NFS location, other than set up the NFS location properly and make sure
>> if you are directing to it by name that it will resolve to the right
>> address. Hadoop doesn't care where it is, as long as it can read from and
>> write to it. The fact that you are telling it to read/write from/to a NFS
>> location that happens to be mounted as a local filesystem object doesn't
>> matter - you could direct it to the local /hduser/ path and set the -fs
>> local option, and it would end up on the NFS mount, because that's where
>> the NFS mount actually exists, or you could direct it to the absolute
>> network location of the folder that you want, it shouldn't make a
>> difference.
>>
>>  *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>>   100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>>
>> On Fri, Dec 20, 2013 at 5:27 AM, Atish Kathpal <atish.kathpal@gmail.com>wrote:
>>
>> Hello
>>
>>  The picture below describes the deployment architecture I am trying to
>> achieve.
>> However, when I run the wordcount example code with the below
>> configuration, by issuing the command from the master node, I notice only
>> the master node spawning map tasks and completing the submitted job. Below
>> is the command I used:
>>
>>  *hadoop jar
>> /hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
>> wordcount -fs local -jt local /hduser/mount_point/  /results*
>>
>>  *Question: How can I leverage both the hadoop nodes for running MR,
>> while serving my data from the common NFS mount point running my filesystem
>> at the backend? Has any one tried such a setup before?*
>> [image: Inline image 1]
>>
>>  Thanks!
>>
>>
>>
>>
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message