hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Silvina Caíno Lores <silvi.ca...@gmail.com>
Subject Re: Job stuck in running state on Hadoop 2.2.0
Date Wed, 11 Dec 2013 09:59:09 GMT
I checked yarn-site.xml configuration and I tried to run the program
without the memory configurations I found somewhere and assumed that would
work (yarn.nodemanager.resource.memory-mb=2200 and
yarn.scheduler.minimum-allocation-mb=500) following Adam's advice and the
example worked beautifully :D Thanks a lot Adam for your suggestion!

To prevent future disasters, may you recommend a configuration guide or
give some hints in proper resource management?

Thank you once more!



On 11 December 2013 10:32, Silvina Caíno Lores <silvi.caino@gmail.com>wrote:

> OK that was indeed a classpath issue, which I solved by directly exporting
> the output of hadoop classpath (ie. the list of neeed jars, see this<http://doc.mapr.com/display/MapR/hadoop+classpath>)
> into HADOOP_CLASSPATH in hadoop-env.sh and yarn-env.sh
>
> With this fixed, the stuck issue came back so I will study Adam's
> suggestion
>
>
> On 11 December 2013 10:01, Silvina Caíno Lores <silvi.caino@gmail.com>wrote:
>
>> Actually now it seems to be running (or at least attempting to run) but I
>> get further errors:
>>
>> hadoop jar
>> ~/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar
>> pi 1 100
>>
>> INFO mapreduce.Job: Job job_1386751964857_0001 failed with state FAILED
>> due to: Application application_1386751964857_0001 failed 2 times due to AM
>> Container for appattempt_1386751964857_0001_000002 exited with exitCode: 1
>> due to: Exception from container-launch:
>> org.apache.hadoop.util.Shell$ExitCodeException:
>> at org.apache.hadoop.util.Shell.runCommand(Shell.java:504)
>> at org.apache.hadoop.util.Shell.run(Shell.java:417)
>> at
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:636)
>> at
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>> at
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
>> at
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:724)
>>
>>
>>
>> I guess it seems some sort of classpath issue because of this log:
>>
>> /scratch/HDFS-scaino-2/logs/application_1386751964857_0001/container_1386751964857_0001_01_000001$
>> cat stderr
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/hadoop/service/CompositeService
>> at java.lang.ClassLoader.defineClass1(Native Method)
>> at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
>> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.hadoop.service.CompositeService
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 13 more
>>
>>
>> I haven't found a solution yet despite the classpath looks nice:
>>
>> hadoop classpath
>>
>>
>> /home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/etc/hadoop:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/common/lib/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/common/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/hdfs:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/hdfs/lib/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/hdfs/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/yarn/lib/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/yarn/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/lib/*:/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar
>>
>>
>> Could that be related to the previous launch errors??
>>
>> Thanks in advance :)
>>
>>
>>
>>
>> On 11 December 2013 00:29, Adam Kawa <kawa.adam@gmail.com> wrote:
>>
>>> It sounds like the job was successfully submitted to the cluster, but
>>> there as some problem when starting/running AM, so that no progress is
>>> made. It happened to me once, when I was playing with YARN on a cluster
>>> consisting of very small machines, and I mis-configured YARN to allocated
>>> to AM more memory than the actual memory available on any machine on my
>>> cluster. So that RM was not able to start AM anywhere due to inability to
>>> find big enough container.
>>>
>>> Could you show the logs from the job? The link should be available on
>>> your console after you submit a job e.g.
>>> 13/12/10 10:41:21 INFO mapreduce.Job: The url to track the job:
>>> http://compute-7-2:8088/proxy/application_1386668372725_0001/
>>>
>>>
>>> 2013/12/10 Silvina Caíno Lores <silvi.caino@gmail.com>
>>>
>>>> Thank you! I realized that, despite I exported the variables in the
>>>> scripts, there were a few errors and my desired configuration wasn't being
>>>> used (which explained other strange behavior).
>>>>
>>>> However, I'm still getting the same issue with the examples, for
>>>> instance:
>>>>
>>>> hadoop jar
>>>> ~/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar
>>>> pi 1 100
>>>> Number of Maps = 1
>>>> Samples per Map = 100
>>>> 13/12/10 10:41:18 WARN util.NativeCodeLoader: Unable to load
>>>> native-hadoop library for your platform... using builtin-java classes where
>>>> applicable
>>>> Wrote input for Map #0
>>>> Starting Job
>>>> 13/12/10 10:41:19 INFO client.RMProxy: Connecting to ResourceManager at
>>>> /0.0.0.0:8032
>>>> 13/12/10 10:41:20 INFO input.FileInputFormat: Total input paths to
>>>> process : 1
>>>> 13/12/10 10:41:20 INFO mapreduce.JobSubmitter: number of splits:1
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: user.name is
>>>> deprecated. Instead, use mapreduce.job.user.name
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.jar is
>>>> deprecated. Instead, use mapreduce.job.jar
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>> mapred.map.tasks.speculative.execution is deprecated. Instead, use
>>>> mapreduce.map.speculative
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.reduce.tasks
>>>> is deprecated. Instead, use mapreduce.job.reduces
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>> mapred.output.value.class is deprecated. Instead, use
>>>> mapreduce.job.output.value.class
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use
>>>> mapreduce.reduce.speculative
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapreduce.map.class
>>>> is deprecated. Instead, use mapreduce.job.map.class
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.job.name is
>>>> deprecated. Instead, use mapreduce.job.name
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>> mapreduce.reduce.class is deprecated. Instead, use
>>>> mapreduce.job.reduce.class
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>> mapreduce.inputformat.class is deprecated. Instead, use
>>>> mapreduce.job.inputformat.class
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.input.dir is
>>>> deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.output.dir is
>>>> deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>> mapreduce.outputformat.class is deprecated. Instead, use
>>>> mapreduce.job.outputformat.class
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.map.tasks is
>>>> deprecated. Instead, use mapreduce.job.maps
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation:
>>>> mapred.output.key.class is deprecated. Instead, use
>>>> mapreduce.job.output.key.class
>>>> 13/12/10 10:41:20 INFO Configuration.deprecation: mapred.working.dir is
>>>> deprecated. Instead, use mapreduce.job.working.dir
>>>> 13/12/10 10:41:20 INFO mapreduce.JobSubmitter: Submitting tokens for
>>>> job: job_1386668372725_0001
>>>> 13/12/10 10:41:20 INFO impl.YarnClientImpl: Submitted application
>>>> application_1386668372725_0001 to ResourceManager at /0.0.0.0:8032
>>>> 13/12/10 10:41:21 INFO mapreduce.Job: The url to track the job:
>>>> http://compute-7-2:8088/proxy/application_1386668372725_0001/
>>>> 13/12/10 10:41:21 INFO mapreduce.Job: Running job:
>>>> job_1386668372725_0001
>>>> 13/12/10 10:41:31 INFO mapreduce.Job: Job job_1386668372725_0001
>>>> running in uber mode : false
>>>> 13/12/10 10:41:31 INFO mapreduce.Job: map 0% reduce 0%
>>>> ---- stuck here ----
>>>>
>>>>
>>>> I hope the problem is not in the environment files. I have the
>>>> following at the beginning of hadoop-env.sh:
>>>>
>>>> # The java implementation to use.
>>>> export JAVA_HOME=/home/software/jdk1.7.0_25/
>>>>
>>>> # The jsvc implementation to use. Jsvc is required to run secure
>>>> datanodes.
>>>> #export JSVC_HOME=${JSVC_HOME}
>>>>
>>>> export
>>>> HADOOP_INSTALL=/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT
>>>>
>>>> export HADOOP_HDFS_HOME=$HADOOP_INSTALL
>>>> export HADOOP_COMMON_HOME=$HADOOP_INSTALL
>>>> export HADOOP_CONF_DIR=$HADOOP_INSTALL"/etc/hadoop"
>>>>
>>>>
>>>> and this in yarn-env.sh:
>>>>
>>>> export JAVA_HOME=/home/software/jdk1.7.0_25/
>>>>
>>>> export
>>>> HADOOP_INSTALL=/home/scaino/hadoop-2.2.0-maven/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT
>>>>
>>>> export HADOOP_HDFS_HOME=$HADOOP_INSTALL
>>>> export HADOOP_COMMON_HOME=$HADOOP_INSTALL
>>>> export HADOOP_CONF_DIR=$HADOOP_INSTALL"/etc/hadoop"
>>>>
>>>>
>>>> Not sure what to do about HADOOP_YARN_USER though, since I don't have a
>>>> dedicated user to run the demons.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> On 10 December 2013 10:10, Taka Shinagawa <taka.epsilon@gmail.com>wrote:
>>>>
>>>>> I had a similar problem after setting up Hadoop 2.2.0 based on the
>>>>> instructions at
>>>>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html
>>>>>
>>>>> Although it's not documented on the page, I needed to
>>>>> edit hadoop-env.sh and yarn-env.sh as well to update
>>>>> JAVA_HOME, HADOOP_CONF_DIR, HADOOP_YARN_USER and YARN_CONF_DIR.
>>>>>
>>>>> Once these variables are set, I was able to run the example
>>>>> successfully.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Dec 9, 2013 at 11:37 PM, Silvina Caíno Lores <
>>>>> silvi.caino@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I'm having trouble running the Hadoop examples in a single node.
All
>>>>>> the executions get stuck at the running state at 0% map and reduce
and the
>>>>>> logs don't seem to indicate any issue, besides the need to kill the
node
>>>>>> manager:
>>>>>>
>>>>>> compute-0-7-3: nodemanager did not stop gracefully after 5 seconds:
>>>>>> killing with kill -9
>>>>>>
>>>>>> RM
>>>>>>
>>>>>> 2013-12-09 11:52:22,466 INFO
>>>>>> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher:
>>>>>> Command to launch container container_1386585879247_0001_01_000001
:
>>>>>> $JAVA_HOME/bin/java -Dlog4j.configuration=container-log4j.properties
>>>>>> -Dyarn.app.container.log.dir=<LOG_DIR> -Dyarn.app.container.log.filesize=0
>>>>>> -Dhadoop.root.logger=INFO,CLA -Xmx1024m
>>>>>> org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1><LOG_DIR>/stdout
>>>>>> 2><LOG_DIR>/stderr
>>>>>> 2013-12-09 11:52:22,882 INFO
>>>>>> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher:
Done
>>>>>> launching container Container: [ContainerId:
>>>>>> container_1386585879247_0001_01_000001, NodeId: compute-0-7-3:8010,
>>>>>> NodeHttpAddress: compute-0-7-3:8042, Resource: <memory:2000, vCores:1>,
>>>>>> Priority: 0, Token: Token { kind: ContainerToken, service:
>>>>>> 10.0.7.3:8010 }, ] for AM appattempt_1386585879247_0001_000001
>>>>>> 2013-12-09 11:52:22,883 INFO
>>>>>> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
>>>>>> appattempt_1386585879247_0001_000001 State change from ALLOCATED
to LAUNCHED
>>>>>> 2013-12-09 11:52:23,371 INFO
>>>>>> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
>>>>>> container_1386585879247_0001_01_000001 Container Transitioned from
ACQUIRED
>>>>>> to RUNNING
>>>>>> 2013-12-09 11:52:30,922 INFO
>>>>>> SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for
>>>>>> appattempt_1386585879247_0001_000001 (auth:SIMPLE)
>>>>>> 2013-12-09 11:52:30,938 INFO
>>>>>> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
AM
>>>>>> registration appattempt_1386585879247_0001_000001
>>>>>> 2013-12-09 11:52:30,939 INFO
>>>>>> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=scaino
>>>>>> IP=10.0.7.3 OPERATION=Register App Master TARGET=ApplicationMasterService
>>>>>> RESULT=SUCCESS APPID=application_1386585879247_0001
>>>>>> APPATTEMPTID=appattempt_1386585879247_0001_000001
>>>>>> 2013-12-09 11:52:30,941 INFO
>>>>>> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
>>>>>> appattempt_1386585879247_0001_000001 State change from LAUNCHED to
RUNNING
>>>>>> 2013-12-09 11:52:30,941 INFO
>>>>>> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
>>>>>> application_1386585879247_0001 State change from ACCEPTED to RUNNING
>>>>>>
>>>>>>
>>>>>> NM
>>>>>>
>>>>>> 2013-12-10 08:26:02,100 INFO
>>>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices:
Got
>>>>>> event CONTAINER_STOP for appId application_1386585879247_0001
>>>>>> 2013-12-10 08:26:02,102 INFO
>>>>>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
>>>>>> Deleting absolute path :
>>>>>> /scratch/HDFS-scaino-2/tmp/nm-local-dir/usercache/scaino/appcache/application_1386585879247_0001
>>>>>> 2013-12-10 08:26:02,103 INFO
>>>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices:
Got
>>>>>> event APPLICATION_STOP for appId application_1386585879247_0001
>>>>>> 2013-12-10 08:26:02,110 INFO
>>>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>>>>>> Application application_1386585879247_0001 transitioned from
>>>>>> APPLICATION_RESOURCES_CLEANINGUP to FINISHED
>>>>>> 2013-12-10 08:26:02,157 INFO
>>>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler:
>>>>>> Scheduling Log Deletion for application: application_1386585879247_0001,
>>>>>> with delay of 10800 seconds
>>>>>> 2013-12-10 08:26:04,688 INFO
>>>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>>>>>> Stopping resource-monitoring for container_1386585879247_0001_01_000001
>>>>>> 2013-12-10 08:26:05,838 INFO
>>>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>>>>>> Done waiting for Applications to be Finished. Still alive:
>>>>>> [application_1386585879247_0001]
>>>>>> 2013-12-10 08:26:05,839 INFO org.apache.hadoop.ipc.Server: Stopping
>>>>>> server on 8010
>>>>>> 2013-12-10 08:26:05,846 INFO org.apache.hadoop.ipc.Server: Stopping
>>>>>> IPC Server listener on 8010
>>>>>> 2013-12-10 08:26:05,847 INFO org.apache.hadoop.ipc.Server: Stopping
>>>>>> IPC Server Responder
>>>>>>
>>>>>> I tried the pi and wordcount examples with same results, any ideas
on
>>>>>> how to debug this?
>>>>>>
>>>>>> Thanks in advance.
>>>>>>
>>>>>> Regards,
>>>>>> Silvina Caíno
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message