hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish Jain <ashja...@gmail.com>
Subject Re: Distributing the code to multiple nodes
Date Thu, 16 Jan 2014 07:09:54 GMT
Voila!! It worked finally :). Thanks a lot for all the support from all the
folks in this forum. So here is the summary for others on what I did
finally to solve this up:

1) change the framework to yarn using mapreduce.framework.name in
mapred-site.xml
2) In yarn-site.xml add the following properties
<name>yarn.nodemanager.resource.memory-mb</name>
<name>yarn.scheduler.minimum-allocation-mb</name>
3) In mapred-site.xml add the following properties
<name>mapreduce.map.memory.mb</name>
<name>mapreduce.reduce.memory.mb</name>
<name>mapreduce.map.java.opts</name>
<name>mapreduce.reduce.java.opts</name>
4) Use capacity scheduler. I think fair scheduler may also work but I used
capacity scheduler

Start the system and run the jobs it will be distributed across all the
nodes. I could see 8 map jobs running because I had 8 data blocks and also
all the nodes serving the request. However I still see only 1 reduce job I
will address that in a separate post

--Ashish


On Wed, Jan 15, 2014 at 7:23 PM, sudhakara st <sudhakara.st@gmail.com>wrote:

> Hello Ashish
>
>
> WARN
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> Node : l1-DEV05:1004 does not have sufficient resource for request :
> {Priority: 0, Capability: <memory:2048, vCores:1>, # Containers: 1,
> Location: *, Relax Locality: true} node total capability : <memory:1024,
> vCores:8>
>
> Resource manager trying allocate memory 2GB but it available 1GB.
>
>
> On Wed, Jan 15, 2014 at 7:07 PM, Ashish Jain <ashjain2@gmail.com> wrote:
>
>> I tried that but somehow my map reduce jobs do not execute at all once I
>> set it to yarn
>>
>>
>> On Wed, Jan 15, 2014 at 7:00 PM, Nirmal Kumar <nirmal.kumar@impetus.co.in
>> > wrote:
>>
>>>  Surely you don’t have to set **mapreduce.jobtracker.address** in
>>> mapred-site.xml
>>>
>>>
>>>
>>> In mapred-site.xml you just have to mention:
>>>
>>> <property>
>>>
>>> <name>mapreduce.framework.name</name>
>>>
>>> <value>yarn</value>
>>>
>>> </property>
>>>
>>>
>>>
>>> -Nirmal
>>>
>>> *From:* Ashish Jain [mailto:ashjain2@gmail.com]
>>> *Sent:* Wednesday, January 15, 2014 6:44 PM
>>>
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Distributing the code to multiple nodes
>>>
>>>
>>>
>>> I think this is the problem. I have not set
>>> "mapreduce.jobtracker.address" in my mapred-site.xml and by default it is
>>> set to local. Now the question is how to set it up to remote. Documentation
>>> says I need to specify the host:port of the job tracker for this. As we
>>> know hadoop 2.2.0 is completely overhauled and there is no concept of task
>>> tracker and job tracker. Instead there is now resource manager and node
>>> manager. So in this case what do I set as "mapreduce.jobtracker.address".
>>> Do I set is resourceMangerHost:resourceMangerPort?
>>>
>>> --Ashish
>>>
>>>
>>>
>>> On Wed, Jan 15, 2014 at 4:20 PM, Ashish Jain <ashjain2@gmail.com> wrote:
>>>
>>>  Hi Sudhakar,
>>>
>>> Indeed there was a type the complete command is as follows except the
>>> main class since my manifest has the entry for main class.
>>> /hadoop jar wordCount.jar  /opt/ApacheHadoop/temp/worker.log
>>> /opt/ApacheHadoop/out/
>>>
>>> Next I killed the datanode in 10.12.11.210 and l see the following
>>> messages in the log files. Looks like the namenode is still trying to
>>> assign the complete task to one single node and since it does not find the
>>> complete data set in one node it is complaining.
>>>
>>>
>>> 2014-01-15 16:38:26,894 WARN
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>>> Node : l1-DEV05:1004 does not have sufficient resource for request :
>>> {Priority: 0, Capability: <memory:2048, vCores:1>, # Containers: 1,
>>> Location: *, Relax Locality: true} node total capability : <memory:1024,
>>> vCores:8>
>>> 2014-01-15 16:38:27,348 WARN
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>>> Node : l1dev-211:1004 does not have sufficient resource for request :
>>> {Priority: 0, Capability: <memory:2048, vCores:1>, # Containers: 1,
>>> Location: *, Relax Locality: true} node total capability : <memory:1024,
>>> vCores:8>
>>> 2014-01-15 16:38:27,871 WARN
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>>> Node : l1-dev06:1004 does not have sufficient resource for request :
>>> {Priority: 0, Capability: <memory:2048, vCores:1>, # Containers: 1,
>>> Location: *, Relax Locality: true} node total capability : <memory:1024,
>>> vCores:8>
>>> 2014-01-15 16:38:27,897 WARN
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>>> Node : l1-DEV05:1004 does not have sufficient resource for request :
>>> {Priority: 0, Capability: <memory:2048, vCores:1>, # Containers: 1,
>>> Location: *, Relax Locality: true} node total capability : <memory:1024,
>>> vCores:8>
>>> 2014-01-15 16:38:28,349 WARN
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>>> Node : l1dev-211:1004 does not have sufficient resource for request :
>>> {Priority: 0, Capability: <memory:2048, vCores:1>, # Containers: 1,
>>> Location: *, Relax Locality: true} node total capability : <memory:1024,
>>> vCores:8>
>>> 2014-01-15 16:38:28,874 WARN
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>>> Node : l1-dev06:1004 does not have sufficient resource for request :
>>> {Priority: 0, Capability: <memory:2048, vCores:1>, # Containers: 1,
>>> Location: *, Relax Locality: true} node total capability : <memory:1024,
>>> vCores:8>
>>> 2014-01-15 16:38:28,900 WARN
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>>> Node : l1-DEV05:1004 does not have sufficient resource for request :
>>> {Priority: 0, Capability: <memory:2048, vCores:1>, # Containers: 1,
>>> Location: *, Relax Locality: true} node total capability : <memory:1024,
>>> vCores:8>
>>>
>>>   --Ashish
>>>
>>>
>>>
>>> On Wed, Jan 15, 2014 at 3:59 PM, sudhakara st <sudhakara.st@gmail.com>
>>> wrote:
>>>
>>>   Hello Ashish
>>>
>>>
>>>
>>> 2) Run the example again using the command
>>> ./hadoop dfs wordCount.jar /opt/ApacheHadoop/temp/worker.log
>>> /opt/ApacheHadoop/out/
>>>
>>>   Unless if it typo mistake the command should be
>>> ./hadoop jar wordCount.jar WordCount /opt/ApacheHadoop/temp/worker.log
>>> /opt/ApacheHadoop/out/
>>>
>>> One more thing try , just stop datanode process in  10.12.11.210 and run
>>> the job
>>>
>>>
>>>
>>> On Wed, Jan 15, 2014 at 2:07 PM, Ashish Jain <ashjain2@gmail.com> wrote:
>>>
>>>     Hello Sudhakara,
>>>
>>> Thanks for your suggestion. However once I change the mapreduce
>>> framework to yarn my map reduce jobs does not get executed at all. It seems
>>> it is waiting on some thread indefinitely. Here is what I have done
>>>
>>> 1) Set the mapreduce framework to yarn in mapred-site.xml
>>> <property>
>>>  <name>mapreduce.framework.name</name>
>>>  <value>yarn</value>
>>> </property>
>>>
>>> 2) Run the example again using the command
>>>
>>> ./hadoop dfs wordCount.jar /opt/ApacheHadoop/temp/worker.log
>>> /opt/ApacheHadoop/out/
>>>
>>> The jobs are just stuck and do not move further.
>>>
>>>   I also tried the following and it complains of filenotfound exception
>>> and some security exception
>>>
>>> ./hadoop dfs wordCount.jar file:///opt/ApacheHadoop/temp/worker.log
>>> file:///opt/ApacheHadoop/out/
>>>
>>> Below is the status of the job from hadoop application console. The
>>> progress bar does not move at all.
>>>
>>>
>>>
>>> *ID *
>>>
>>> *User *
>>>
>>> *Name *
>>>
>>> *Application Type *
>>>
>>> *Queue *
>>>
>>> *StartTime *
>>>
>>> *FinishTime *
>>>
>>> *State *
>>>
>>> *FinalStatus *
>>>
>>> *Progress *
>>>
>>> *Tracking UI *
>>>
>>> application_1389771586883_0002<http://10.12.11.210:8088/cluster/app/application_1389771586883_0002>
>>>
>>> root
>>>
>>> wordcount
>>>
>>> MAPREDUCE
>>>
>>> default
>>>
>>> Wed, 15 Jan 2014 07:52:04 GMT
>>>
>>> N/A
>>>
>>> ACCEPTED
>>>
>>> UNDEFINED
>>>
>>> UNASSIGNE <http://10.12.11.210:8088/cluster/apps>
>>>
>>>
>>>
>>> Please advice what should I do
>>>
>>> --Ashish
>>>
>>>
>>>
>>> On Tue, Jan 14, 2014 at 5:48 PM, sudhakara st <sudhakara.st@gmail.com>
>>> wrote:
>>>
>>>   Hello Ashish
>>>
>>> It seems job is running in Local job runner(LocalJobRunner) by reading
>>> the Local file system. Can you try by give the full URI path of the input
>>> and output path.
>>>
>>> like
>>>
>>> $hadoop jar program.jar   ProgramName -Dmapreduce.framework.name=yarn
>>> file:///home/input/  file:///home/output/
>>>
>>>
>>>
>>> On Mon, Jan 13, 2014 at 3:02 PM, Ashish Jain <ashjain2@gmail.com> wrote:
>>>
>>>   German,
>>>
>>> This does not seem to be helping. I tried to use the Fairscheduler as my
>>> resource manger but the behavior remains same. I could see the
>>> fairscheduler log getting continuous heart beat from both the other nodes.
>>> But it is still not distributing the work to other nodes. What I did next
>>> was started 3 jobs simultaneously so that may be some part of one of the
>>> job be distributed to other nodes. However still only one node is being
>>> used :(((. What is that is going wrong can some one help?
>>>
>>> Sample of fairsheduler log:
>>> 2014-01-13 15:13:54,293 HEARTBEAT       l1dev-211
>>> 2014-01-13 15:13:54,953 HEARTBEAT       l1-dev06
>>> 2014-01-13 15:13:54,988 HEARTBEAT       l1-DEV05
>>> 2014-01-13 15:13:55,295 HEARTBEAT       l1dev-211
>>> 2014-01-13 15:13:55,956 HEARTBEAT       l1-dev06
>>> 2014-01-13 15:13:55,993 HEARTBEAT       l1-DEV05
>>> 2014-01-13 15:13:56,297 HEARTBEAT       l1dev-211
>>> 2014-01-13 15:13:56,960 HEARTBEAT       l1-dev06
>>> 2014-01-13 15:13:56,997 HEARTBEAT       l1-DEV05
>>> 2014-01-13 15:13:57,299 HEARTBEAT       l1dev-211
>>> 2014-01-13 15:13:57,964 HEARTBEAT       l1-dev06
>>> 2014-01-13 15:13:58,001 HEARTBEAT       l1-DEV05
>>>
>>> My Data distributed as blocks to other nodes. The host with IP
>>> 10.12.11.210 has all the data and this is the one which is serving all the
>>> request.
>>>
>>> Total number of blocks: 8
>>> 1073741866:         10.12.11.211:50010    View Block Info
>>> 10.12.11.210:50010    View Block Info
>>> 1073741867:         10.12.11.211:50010    View Block Info
>>> 10.12.11.210:50010    View Block Info
>>> 1073741868:         10.12.11.210:50010    View Block Info
>>> 10.12.11.209:50010    View Block Info
>>> 1073741869:         10.12.11.210:50010    View Block Info
>>> 10.12.11.209:50010    View Block Info
>>> 1073741870:         10.12.11.211:50010    View Block Info
>>> 10.12.11.210:50010    View Block Info
>>> 1073741871:         10.12.11.210:50010    View Block Info
>>> 10.12.11.209:50010    View Block Info
>>> 1073741872:         10.12.11.211:50010    View Block Info
>>> 10.12.11.210:50010    View Block Info
>>> 1073741873:         10.12.11.210:50010    View Block Info
>>> 10.12.11.209:50010    View Block Info
>>>
>>>
>>>
>>> Someone please advice on how to go about this.
>>>
>>> --Ashish
>>>
>>>
>>>
>>> On Fri, Jan 10, 2014 at 12:58 PM, Ashish Jain <ashjain2@gmail.com>
>>> wrote:
>>>
>>>  Thanks for all these suggestions. Somehow I do not have access to the
>>> servers today and will try the suggestions made on monday and will let you
>>> know how it goes.
>>>
>>> --Ashish
>>>
>>>
>>>
>>> On Thu, Jan 9, 2014 at 7:53 PM, German Florez-Larrahondo <
>>> german.fl@samsung.com> wrote:
>>>
>>>  Ashish
>>>
>>> Could this be related to the scheduler you are using and its settings?.
>>>
>>>
>>>
>>> On lab environments when running a single type of job I often use
>>> FairScheduler (the YARN default in 2.2.0 is CapacityScheduler) and it does
>>> a good job distributing the load.
>>>
>>>
>>>
>>> You could give that a try (
>>> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
>>> )
>>>
>>>
>>>
>>> I think just changing yarn-site.xml  as follows could demonstrate this
>>> theory (note that  how the jobs are scheduled depend on resources such as
>>> memory on the nodes and you would need to setup yarn-site.xml accordingly).
>>>
>>>
>>>
>>> <property>
>>>
>>>   <name>yarn.resourcemanager.scheduler.class</name>
>>>
>>>
>>> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>>>
>>> </property>
>>>
>>>
>>>
>>> Regards
>>>
>>> ./g
>>>
>>>
>>>
>>>
>>>
>>> *From:* Ashish Jain [mailto:ashjain2@gmail.com]
>>> *Sent:* Thursday, January 09, 2014 6:46 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Distributing the code to multiple nodes
>>>
>>>
>>>
>>> Another point to add here 10.12.11.210 is the host which has everything
>>> running including a slave datanode. Data was also distributed this host as
>>> well as the jar file. Following are running on 10.12.11.210
>>>
>>> 7966 DataNode
>>> 8480 NodeManager
>>> 8353 ResourceManager
>>> 8141 SecondaryNameNode
>>> 7834 NameNode
>>>
>>>
>>>
>>> On Thu, Jan 9, 2014 at 6:12 PM, Ashish Jain <ashjain2@gmail.com> wrote:
>>>
>>> Logs were updated only when I copied the data. After copying the data
>>> there has been no updates on the log files.
>>>
>>>
>>>
>>> On Thu, Jan 9, 2014 at 5:08 PM, Chris Mawata <chris.mawata@gmail.com>
>>> wrote:
>>>
>>> Do the logs on the three nodes contain anything interesting?
>>> Chris
>>>
>>> On Jan 9, 2014 3:47 AM, "Ashish Jain" <ashjain2@gmail.com> wrote:
>>>
>>> Here is the block info for the record I distributed. As can be seen only
>>> 10.12.11.210 has all the data and this is the node which is serving all the
>>> request. Replicas are available with 209 as well as 210
>>>
>>> 1073741857:         10.12.11.210:50010    View Block Info
>>> 10.12.11.209:50010    View Block Info
>>> 1073741858:         10.12.11.210:50010    View Block Info
>>> 10.12.11.211:50010    View Block Info
>>> 1073741859:         10.12.11.210:50010    View Block Info
>>> 10.12.11.209:50010    View Block Info
>>> 1073741860:         10.12.11.210:50010    View Block Info
>>> 10.12.11.211:50010    View Block Info
>>> 1073741861:         10.12.11.210:50010    View Block Info
>>> 10.12.11.209:50010    View Block Info
>>> 1073741862:         10.12.11.210:50010    View Block Info
>>> 10.12.11.209:50010    View Block Info
>>> 1073741863:         10.12.11.210:50010    View Block Info
>>> 10.12.11.209:50010    View Block Info
>>> 1073741864:         10.12.11.210:50010    View Block Info
>>> 10.12.11.209:50010    View Block Info
>>>
>>> --Ashish
>>>
>>>
>>>
>>> On Thu, Jan 9, 2014 at 2:11 PM, Ashish Jain <ashjain2@gmail.com> wrote:
>>>
>>> Hello Chris,
>>>
>>> I have now a cluster with 3 nodes and replication factor being 2. When I
>>> distribute a file I could see that there are replica of data available in
>>> other nodes. However when I run a map reduce job again only one node is
>>> serving all the request :(. Can you or anyone please provide some more
>>> inputs.
>>>
>>> Thanks
>>> Ashish
>>>
>>>
>>>
>>> On Wed, Jan 8, 2014 at 7:16 PM, Chris Mawata <chris.mawata@gmail.com>
>>> wrote:
>>>
>>> 2 nodes and replication factor of 2 results in a replica of each block
>>> present on each node. This would allow the possibility that a single node
>>> would do the work and yet be data local.  It will probably happen if that
>>> single node has the needed capacity.  More nodes than the replication
>>> factor are needed to force distribution of the processing.
>>> Chris
>>>
>>> On Jan 8, 2014 7:35 AM, "Ashish Jain" <ashjain2@gmail.com> wrote:
>>>
>>> Guys,
>>>
>>> I am sure that only one node is being used. I just know ran the job
>>> again and could see that CPU usage only for one server going high other
>>> server CPU usage remains constant and hence it means other node is not
>>> being used. Can someone help me to debug this issue?
>>>
>>> ++Ashish
>>>
>>>
>>>
>>> On Wed, Jan 8, 2014 at 5:04 PM, Ashish Jain <ashjain2@gmail.com> wrote:
>>>
>>> Hello All,
>>>
>>> I have a 2 node hadoop cluster running with a replication factor of 2. I
>>> have a file of size around 1 GB which when copied to HDFS is replicated to
>>> both the nodes. Seeing the block info I can see the file has been
>>> subdivided into 8 parts which means it has been subdivided into 8 blocks
>>> each of size 128 MB.  I use this file as input to run the word count
>>> program. Some how I feel only one node is doing all the work and the code
>>> is not distributed to other node. How can I make sure code is distributed
>>> to both the nodes? Also is there a log or GUI which can be used for this?
>>>
>>> Please note I am using the latest stable release that is 2.2.0.
>>>
>>> ++Ashish
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>   --
>>>
>>>
>>> Regards,
>>> ...Sudhakara.st
>>>
>>>
>>>
>>>
>>>
>>>
>>>   --
>>>
>>>
>>> Regards,
>>> ...Sudhakara.st
>>>
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------
>>>
>>>
>>>
>>>
>>>
>>>
>>> NOTE: This message may contain information that is confidential,
>>> proprietary, privileged or otherwise protected by law. The message is
>>> intended solely for the named addressee. If received in error, please
>>> destroy and notify the sender. Any use of this email is prohibited when
>>> received in error. Impetus does not represent, warrant and/or guarantee,
>>> that the integrity of this communication has been maintained nor that the
>>> communication is free of errors, virus, interception or interference.
>>>
>>
>>
>
>
> --
>
> Regards,
> ...Sudhakara.st
>
>

Mime
View raw message