hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kawa <kawa.a...@gmail.com>
Subject Re: Job stuck in running state on Hadoop 2.2.0
Date Wed, 11 Dec 2013 19:24:32 GMT
I am glad that I could help.

In our case, we followed mostly the configuration from here:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
(changing
it a bit to adapt to our requirements e.g. today we run 2GB containers
instead of 3-4GB, but it might change in the future). Make also sure that
memorry allocated in mapreduce.map.java.opts is smaller than
mapreduce.map.memory.mb (the same for reduce tasks).


2013/12/11 Silvina Caíno Lores <silvi.caino@gmail.com>

> I checked yarn-site.xml configuration and I tried to run the program
> without the memory configurations I found somewhere and assumed that would
> work (yarn.nodemanager.resource.memory-mb=2200 and
> yarn.scheduler.minimum-allocation-mb=500) following Adam's advice and the
> example worked beautifully :D Thanks a lot Adam for your suggestion!
>
> To prevent future disasters, may you recommend a configuration guide or
> give some hints in proper resource management?
>
> Thank you once more!
>
>
>
> On 11 December 2013 10:32, Silvina Caíno Lores <silvi.caino@gmail.com>wrote:
>
>> OK that was indeed a classpath issue, which I solved by directly
>> exporting the output of hadoop classpath (ie. the list of neeed jars, see
>> this <http://doc.mapr.com/display/MapR/hadoop+classpath>) into
>> HADOOP_CLASSPATH in hadoop-env.sh and yarn-env.sh
>>
>> With this fixed, the stuck issue came back so I will study Adam's
>> suggestion
>>
>>
>> On 11 December 2013 10:01, Silvina Caíno Lores <silvi.caino@gmail.com>wrote:
>>
>>> Actually now it seems to be running (or at least attempting to run) but
>>> I get further errors:
>>>
>>> hadoop jar
>>> ~/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar
>>> pi 1 100
>>>
>>> INFO mapreduce.Job: Job job_1386751964857_0001 failed with state FAILED
>>> due to: Application application_1386751964857_0001 failed 2 times due to AM
>>> Container for appattempt_1386751964857_0001_000002 exited with exitCode: 1
>>> due to: Exception from container-launch:
>>> org.apache.hadoop.util.Shell$ExitCodeException:
>>> at org.apache.hadoop.util.Shell.runCommand(Shell.java:504)
>>> at org.apache.hadoop.util.Shell.run(Shell.java:417)
>>> at
>>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:636)
>>> at
>>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>>> at
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
>>> at
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:724)
>>>
>>>
>>>
>>> I guess it seems some sort of classpath issue because of this log:
>>>
>>> /scratch/HDFS-scaino-2/logs/application_1386751964857_0001/container_1386751964857_0001_01_000001$
>>> cat stderr
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> org/apache/hadoop/service/CompositeService
>>> at java.lang.ClassLoader.defineClass1(Native Method)
>>> at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
>>> at
>>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>>> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.hadoop.service.CompositeService
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> ... 13 more
>>>
>>>
>>> I haven't found a solution yet despite the classpath looks nice:
>>>
>>> hadoop classpath
>>>
>>>
>>> /home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/etc/hadoop:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/common/lib/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/common/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/hdfs:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/hdfs/lib/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/hdfs/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/yarn/lib/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/yarn/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/lib/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar
>>>
>>>
>>> Could that be related to the previous launch errors??
>>>
>>> Thanks in advance :)
>>>
>>>
>>>
>>>
>>> On 11 December 2013 00:29, Adam Kawa <kawa.adam@gmail.com> wrote:
>>>
>>>> It sounds like the job was successfully submitted to the cluster, but
>>>> there as some problem when starting/running AM, so that no progress is
>>>> made. It happened to me once, when I was playing with YARN on a cluster
>>>> consisting of very small machines, and I mis-configured YARN to allocated
>>>> to AM more memory than the actual memory available on any machine on my
>>>> cluster. So that RM was not able to start AM anywhere due to inability to
>>>> find big enough container.
>>>>
>>>> Could you show the logs from the job? The link should be available on
>>>> your console after you submit a job e.g.
>>>> 13/12/10 10:41:21 INFO mapreduce.Job: The url to track the job:
>>>> http://compute-7-2:8088/proxy/application_1386668372725_0001/
>>>>
>>>>
>>>> 2013/12/10 Silvina Caíno Lores <silvi.caino@gmail.com>
>>>>
>>>>> Thank you! I realized that, despite I exported the variables in the
>>>>> scripts, there were a few errors and my desired configuration wasn't
being
>>>>> used (which explained other strange behavior).
>>>>>
>>>>> However, I'm still getting the same issue with the examples, for
>>>>> instance:
>>>>>
>>>>> hadoop jar
>>>>> ~/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar
>>>>> pi 1 100
>>>>> Number of Maps = 1
>>>>> Samples per Map = 100
>>>>> 13/12/10 10:41:18 WARN util.NativeCodeLoader: Unable to load
>>>>> native-hadoop library for your platform... using builtin-java classes
where
>>>>> applicable
>>>>> Wrote input for Map #0
>>>>> Starting Job
>>>>> 13/12/10 10:41:19 INFO client.RMProxy: Connecting to ResourceManager
>>>>> at /0.0.0.0:8032
>>>>> 13/12/10 10:41:20 INFO input.FileInputFormat: Total input paths to
>>>>> process : 1
>>>>> 13/12/10 10:41:20 INFO mapreduce.JobSubmitter: number of splits:1
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: user.name is
>>>>> deprecated. Instead, use mapreduce.job.user.name
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.jar is
>>>>> deprecated. Instead, use mapreduce.job.jar
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>>> mapred.map.tasks.speculative.execution is deprecated. Instead, use
>>>>> mapreduce.map.speculative
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.reduce.tasks
>>>>> is deprecated. Instead, use mapreduce.job.reduces
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>>> mapred.output.value.class is deprecated. Instead, use
>>>>> mapreduce.job.output.value.class
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>>> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use
>>>>> mapreduce.reduce.speculative
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapreduce.map.class
>>>>> is deprecated. Instead, use mapreduce.job.map.class
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.job.name is
>>>>> deprecated. Instead, use mapreduce.job.name
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>>> mapreduce.reduce.class is deprecated. Instead, use
>>>>> mapreduce.job.reduce.class
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>>> mapreduce.inputformat.class is deprecated. Instead, use
>>>>> mapreduce.job.inputformat.class
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.input.dir is
>>>>> deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.output.dir is
>>>>> deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>>> mapreduce.outputformat.class is deprecated. Instead, use
>>>>> mapreduce.job.outputformat.class
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.map.tasks is
>>>>> deprecated. Instead, use mapreduce.job.maps
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>>> mapred.output.key.class is deprecated. Instead, use
>>>>> mapreduce.job.output.key.class
>>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.working.dir
>>>>> is deprecated. Instead, use mapreduce.job.working.dir
>>>>> 13/12/10 10:41:20 INFO mapreduce.JobSubmitter: Submitting tokens for
>>>>> job: job_1386668372725_0001
>>>>> 13/12/10 10:41:20 INFO impl.YarnClientImpl: Submitted application
>>>>> application_1386668372725_0001 to ResourceManager at /0.0.0.0:8032
>>>>> 13/12/10 10:41:21 INFO mapreduce.Job: The url to track the job:
>>>>> http://compute-7-2:8088/proxy/application_1386668372725_0001/
>>>>> 13/12/10 10:41:21 INFO mapreduce.Job: Running job:
>>>>> job_1386668372725_0001
>>>>> 13/12/10 10:41:31 INFO mapreduce.Job: Job job_1386668372725_0001
>>>>> running in uber mode : false
>>>>> 13/12/10 10:41:31 INFO mapreduce.Job: map 0% reduce 0%
>>>>> ---- stuck here ----
>>>>>
>>>>>
>>>>> I hope the problem is not in the environment files. I have the
>>>>> following at the beginning of hadoop-env.sh:
>>>>>
>>>>> # The java implementation to use.
>>>>> export JAVA_HOME=/home/software/jdk1.7.0_25/
>>>>>
>>>>> # The jsvc implementation to use. Jsvc is required to run secure
>>>>> datanodes.
>>>>> #export JSVC_HOME=${JSVC_HOME}
>>>>>
>>>>> export
>>>>> HADOOP_INSTALL=/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT
>>>>>
>>>>> export HADOOP_HDFS_HOME=$HADOOP_INSTALL
>>>>> export HADOOP_COMMON_HOME=$HADOOP_INSTALL
>>>>> export HADOOP_CONF_DIR=$HADOOP_INSTALL"/etc/hadoop"
>>>>>
>>>>>
>>>>> and this in yarn-env.sh:
>>>>>
>>>>> export JAVA_HOME=/home/software/jdk1.7.0_25/
>>>>>
>>>>> export
>>>>> HADOOP_INSTALL=/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT
>>>>>
>>>>> export HADOOP_HDFS_HOME=$HADOOP_INSTALL
>>>>> export HADOOP_COMMON_HOME=$HADOOP_INSTALL
>>>>> export HADOOP_CONF_DIR=$HADOOP_INSTALL"/etc/hadoop"
>>>>>
>>>>>
>>>>> Not sure what to do about HADOOP_YARN_USER though, since I don't have
>>>>> a dedicated user to run the demons.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>> On 10 December 2013 10:10, Taka Shinagawa <taka.epsilon@gmail.com>wrote:
>>>>>
>>>>>> I had a similar problem after setting up Hadoop 2.2.0 based on the
>>>>>> instructions at
>>>>>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html
>>>>>>
>>>>>> Although it's not documented on the page, I needed to
>>>>>> edit hadoop-env.sh and yarn-env.sh as well to update
>>>>>> JAVA_HOME, HADOOP_CONF_DIR, HADOOP_YARN_USER and YARN_CONF_DIR.
>>>>>>
>>>>>> Once these variables are set, I was able to run the example
>>>>>> successfully.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 9, 2013 at 11:37 PM, Silvina Caíno Lores <
>>>>>> silvi.caino@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I'm having trouble running the Hadoop examples in a single node.
All
>>>>>>> the executions get stuck at the running state at 0% map and reduce
and the
>>>>>>> logs don't seem to indicate any issue, besides the need to kill
the node
>>>>>>> manager:
>>>>>>>
>>>>>>> compute-0-7-3: nodemanager did not stop gracefully after 5 seconds:
>>>>>>> killing with kill -9
>>>>>>>
>>>>>>> RM
>>>>>>>
>>>>>>> 2013-12-09 11:52:22,466 INFO
>>>>>>> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher:
>>>>>>> Command to launch container container_1386585879247_0001_01_000001
:
>>>>>>> $JAVA_HOME/bin/java -Dlog4j.configuration=container-log4j.properties
>>>>>>> -Dyarn.app.container.log.dir=<LOG_DIR> -Dyarn.app.container.log.filesize=0
>>>>>>> -Dhadoop.root.logger=INFO,CLA -Xmx1024m
>>>>>>> org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1><LOG_DIR>/stdout
>>>>>>> 2><LOG_DIR>/stderr
>>>>>>> 2013-12-09 11:52:22,882 INFO
>>>>>>> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher:
Done
>>>>>>> launching container Container: [ContainerId:
>>>>>>> container_1386585879247_0001_01_000001, NodeId: compute-0-7-3:8010,
>>>>>>> NodeHttpAddress: compute-0-7-3:8042, Resource: <memory:2000,
vCores:1>,
>>>>>>> Priority: 0, Token: Token { kind: ContainerToken, service:
>>>>>>> 10.0.7.3:8010 }, ] for AM appattempt_1386585879247_0001_000001
>>>>>>> 2013-12-09 11:52:22,883 INFO
>>>>>>> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
>>>>>>> appattempt_1386585879247_0001_000001 State change from ALLOCATED
to LAUNCHED
>>>>>>> 2013-12-09 11:52:23,371 INFO
>>>>>>> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
>>>>>>> container_1386585879247_0001_01_000001 Container Transitioned
from ACQUIRED
>>>>>>> to RUNNING
>>>>>>> 2013-12-09 11:52:30,922 INFO
>>>>>>> SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful
for
>>>>>>> appattempt_1386585879247_0001_000001 (auth:SIMPLE)
>>>>>>> 2013-12-09 11:52:30,938 INFO
>>>>>>> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
AM
>>>>>>> registration appattempt_1386585879247_0001_000001
>>>>>>> 2013-12-09 11:52:30,939 INFO
>>>>>>> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=scaino
>>>>>>> IP=10.0.7.3 OPERATION=Register App Master TARGET=ApplicationMasterService
>>>>>>> RESULT=SUCCESS APPID=application_1386585879247_0001
>>>>>>> APPATTEMPTID=appattempt_1386585879247_0001_000001
>>>>>>> 2013-12-09 11:52:30,941 INFO
>>>>>>> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
>>>>>>> appattempt_1386585879247_0001_000001 State change from LAUNCHED
to RUNNING
>>>>>>> 2013-12-09 11:52:30,941 INFO
>>>>>>> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
>>>>>>> application_1386585879247_0001 State change from ACCEPTED to
RUNNING
>>>>>>>
>>>>>>>
>>>>>>> NM
>>>>>>>
>>>>>>> 2013-12-10 08:26:02,100 INFO
>>>>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices:
Got
>>>>>>> event CONTAINER_STOP for appId application_1386585879247_0001
>>>>>>> 2013-12-10 08:26:02,102 INFO
>>>>>>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
>>>>>>> Deleting absolute path :
>>>>>>> /scratch/HDFS-scaino-2/tmp/nm-local-dir/usercache/scaino/appcache/application_1386585879247_0001
>>>>>>> 2013-12-10 08:26:02,103 INFO
>>>>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices:
Got
>>>>>>> event APPLICATION_STOP for appId application_1386585879247_0001
>>>>>>> 2013-12-10 08:26:02,110 INFO
>>>>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>>>>>>> Application application_1386585879247_0001 transitioned from
>>>>>>> APPLICATION_RESOURCES_CLEANINGUP to FINISHED
>>>>>>> 2013-12-10 08:26:02,157 INFO
>>>>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler:
>>>>>>> Scheduling Log Deletion for application: application_1386585879247_0001,
>>>>>>> with delay of 10800 seconds
>>>>>>> 2013-12-10 08:26:04,688 INFO
>>>>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>>>>>>> Stopping resource-monitoring for container_1386585879247_0001_01_000001
>>>>>>> 2013-12-10 08:26:05,838 INFO
>>>>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>>>>>>> Done waiting for Applications to be Finished. Still alive:
>>>>>>> [application_1386585879247_0001]
>>>>>>> 2013-12-10 08:26:05,839 INFO org.apache.hadoop.ipc.Server: Stopping
>>>>>>> server on 8010
>>>>>>> 2013-12-10 08:26:05,846 INFO org.apache.hadoop.ipc.Server: Stopping
>>>>>>> IPC Server listener on 8010
>>>>>>> 2013-12-10 08:26:05,847 INFO org.apache.hadoop.ipc.Server: Stopping
>>>>>>> IPC Server Responder
>>>>>>>
>>>>>>> I tried the pi and wordcount examples with same results, any
ideas
>>>>>>> on how to debug this?
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Silvina Caíno
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message