hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Sprague <sprag...@gmail.com>
Subject Re: YARN ResourceManger running with yarn.root.logger=DEBUG,console
Date Wed, 11 Jan 2017 22:59:23 GMT
ok.  i would attach but... i think there might be an aversion to
attachments so i'll paste inline.  hopefully its not too confusing.

$ cat fair-scheduler.xml

<?xml version="1.0"?>

<!--
  This is a sample configuration file for the Fair Scheduler. For details
  on the options, please refer to the fair scheduler documentation at
  http://hadoop.apache.org/core/docs/r0.21.0/fair_scheduler.html.

  To create your own configuration, copy this file to
conf/fair-scheduler.xml
  and add the following property in mapred-site.xml to point Hadoop to the
  file, replacing [HADOOP_HOME] with the path to your installation
directory:
    <property>
      <name>mapred.fairscheduler.allocation.file</name>
      <value>[HADOOP_HOME]/conf/fair-scheduler.xml</value>
    </property>

  Note that all the parameters in the configuration file below are optional,
  including the parameters inside <pool> and <user> elements. It is only
  necessary to set the ones you want to differ from the defaults.
-->

<!-- https://hadoop.apache.org/docs/r1.2.1/fair_scheduler.html -->

<allocations>

  <!-- NOTE. ** Preemption IS NOT turn on! ** -->

  <!-- Preemption timeout for jobs below their fair share, in seconds.
    If a job is below half its fair share for this amount of time, it
    is allowed to kill tasks from other jobs to go up to its fair share.
    Requires mapred.fairscheduler.preemption to be true in mapred-site.xml.
-->
  <fairSharePreemptionTimeout>600</fairSharePreemptionTimeout>

  <!-- Default min share preemption timeout for pools where it is not
    explicitly configured, in seconds. Requires
mapred.fairscheduler.preemption
    to be set to true in your mapred-site.xml. -->
  <defaultMinSharePreemptionTimeout>600</defaultMinSharePreemptionTimeout>

  <!-- Default running job limit pools where it is not explicitly set. -->
  <queueMaxJobsDefault>20</queueMaxJobsDefault>

  <!-- Default running job limit users where it is not explicitly set. -->
  <userMaxJobsDefault>10</userMaxJobsDefault>


<!--  QUEUES:
         dwr.interactive   : 10 at once
         dwr.batch_sql     : 15 at once
         dwr.batch_hdfs    : 5 at once   (distcp, sqoop, hfs -put, anything
besides 'sql')
         dwr.qa            : 3 at once
         dwr.truck_lane    : 1 at once

         cad.interactive   : 5 at once
         cad.batch         : 10 at once

         comms.interactive : 5 at once
         comms.batch       : 3 at once

         default           : 2 at once   (to discourage its use)
-->


<!-- queue placement -->

  <queuePlacementPolicy>
    <rule name="specified" />
    <rule name="default" />
  </queuePlacementPolicy>


<!-- footprint -->
 <queue name='footprint'>
    <schedulingPolicy>fair</schedulingPolicy>   <!-- can be fifo too -->

    <maxRunningApps>4</maxRunningApps>
    <aclSubmitApps>*</aclSubmitApps>

    <minMaps>10</minMaps>
    <minReduces>5</minReduces>
    <userMaxJobsDefault>50</userMaxJobsDefault>

    <maxMaps>200</maxMaps>
    <maxReduces>200</maxReduces>
    <minResources>20000 mb, 10 vcores</minResources>
    <maxResources>500000 mb, 175 vcores</maxResources>

    <queue name="dev">
       <maxMaps>200</maxMaps>
       <maxReduces>200</maxReduces>
       <minResources>20000 mb, 10 vcores</minResources>
       <maxResources>500000 mb, 175 vcores</maxResources>
    </queue>

    <queue name="stage">
       <maxMaps>200</maxMaps>
       <maxReduces>200</maxReduces>
       <minResources>20000 mb, 10 vcores</minResources>
       <maxResources>500000 mb, 175 vcores</maxResources>
    </queue>
  </queue>

<!-- comms -->
 <queue name='comms'>
    <schedulingPolicy>fair</schedulingPolicy>   <!-- can be fifo too -->

    <queue name="interactive">
       <maxRunningApps>5</maxRunningApps>
       <aclSubmitApps>*</aclSubmitApps>
    </queue>

    <queue name="batch">
       <maxRunningApps>10</maxRunningApps>
       <aclSubmitApps>*</aclSubmitApps>
    </queue>

  </queue>

<!-- cad -->
 <queue name='cad'>
    <schedulingPolicy>fair</schedulingPolicy>   <!-- can be fifo too -->

    <queue name="interactive">
       <maxRunningApps>5</maxRunningApps>
       <aclSubmitApps>*</aclSubmitApps>
    </queue>


    <queue name="batch">
       <maxRunningApps>10</maxRunningApps>
       <aclSubmitApps>*</aclSubmitApps>
    </queue>

  </queue>



<!-- dwr -->
  <queue name="dwr">

    <schedulingPolicy>fair</schedulingPolicy>   <!-- can be fifo too -->
    <minMaps>10</minMaps>
    <minReduces>5</minReduces>
    <userMaxJobsDefault>50</userMaxJobsDefault>

    <maxMaps>200</maxMaps>
    <maxReduces>200</maxReduces>
    <minResources>20000 mb, 10 vcores</minResources>
    <maxResources>500000 mb, 175 vcores</maxResources>

<!-- INTERACTiVE. 5 at once -->
    <queue name="interactive">
        <weight>2.0</weight>
        <maxRunningApps>5</maxRunningApps>

       <maxMaps>200</maxMaps>
       <maxReduces>200</maxReduces>
       <minResources>20000 mb, 10 vcores</minResources>
       <maxResources>500000 mb, 175 vcores</maxResources>

<!-- not used. Number of seconds after which the pool can preempt other
pools -->
        <minSharePreemptionTimeout>60</minSharePreemptionTimeout>

<!-- per user. but given everything is dwr (for now) its not helpful -->
        <userMaxAppsDefault>5</userMaxAppsDefault>
        <aclSubmitApps>*</aclSubmitApps>
    </queue>


<!-- BATCH. 15 at once -->
    <queue name="batch_sql">
        <weight>1.5</weight>
        <maxRunningApps>15</maxRunningApps>

       <maxMaps>200</maxMaps>
       <maxReduces>200</maxReduces>
       <minResources>20000 mb, 10 vcores</minResources>
       <maxResources>500000 mb, 175 vcores</maxResources>

<!-- not used. Number of seconds after which the pool can preempt other
pools -->
        <minSharePreemptionTimeout>300</minSharePreemptionTimeout>

        <userMaxAppsDefault>50</userMaxAppsDefault>
        <aclSubmitApps>*</aclSubmitApps>
    </queue>


<!-- sqoop, distcp, hdfs-put type jobs here. 3 at once -->
    <queue name="batch_hdfs">
        <weight>1.0</weight>
        <maxRunningApps>3</maxRunningApps>

<!-- not used. Number of seconds after which the pool can preempt other
pools -->
        <minSharePreemptionTimeout>300</minSharePreemptionTimeout>
        <userMaxAppsDefault>50</userMaxAppsDefault>
        <aclSubmitApps>*</aclSubmitApps>
    </queue>


<!-- QA. 3 at once -->
    <queue name="qa">
        <weight>1.0</weight>
        <maxRunningApps>100</maxRunningApps>

<!-- not used. Number of seconds after which the pool can preempt other
pools -->
        <minSharePreemptionTimeout>300</minSharePreemptionTimeout>
        <aclSubmitApps>*</aclSubmitApps>
        <userMaxAppsDefault>50</userMaxAppsDefault>

    </queue>

<!-- big, unruly jobs -->
    <queue name="truck_lane">
        <weight>0.75</weight>
        <maxRunningApps>1</maxRunningApps>
        <minMaps>5</minMaps>
        <minReduces>5</minReduces>

<!-- lets try without static values and see how the "weight" works
-->
        <maxMaps>192</maxMaps>
        <maxReduces>192</maxReduces>
        <minResources>20000 mb, 10 vcores</minResources>
        <maxResources>500000 mb, 200 vcores</maxResources>

<!-- not used. Number of seconds after which the pool can preempt other
pools -->
<!--
        <minSharePreemptionTimeout>300</minSharePreemptionTimeout>
        <aclSubmitApps>*</aclSubmitApps>
        <userMaxAppsDefault>50</userMaxAppsDefault>
-->
    </queue>
  </queue>

<!-- DEFAULT. 2 at once -->
  <queue name="default">
       <maxRunningApps>2</maxRunningApps>

       <maxMaps>40</maxMaps>
       <maxReduces>40</maxReduces>
       <minResources>20000 mb, 10 vcores</minResources>
       <maxResources>20000 mb, 10 vcores</maxResources>

<!-- not used. Number of seconds after which the pool can preempt other
pools -->
      <minSharePreemptionTimeout>60</minSharePreemptionTimeout>
      <userMaxAppsDefault>5</userMaxAppsDefault>
      <aclSubmitApps>*</aclSubmitApps>
  </queue>


</allocations>



<!-- some other stuff

    <minResources>10000 mb,0vcores</minResources>
    <maxResources>90000 mb,0vcores</maxResources>

    <minMaps>10</minMaps>
    <minReduces>5</minReduces>

-->

<!-- enabling
   * Bringing the queues in effect:
   Once the required parameters are defined in fair-scheduler.xml file, run
the command to bring the changes in effect.
   yarn rmadmin -refreshQueues
-->

<!-- verifying
  Once the command runs properly, verify if the queues are setup using 2
options:

  1) hadoop queue -list
  or
  2) Open YARN resourcemanager GUI from Resource Manager GUI:
http://<Resouremanager-hostname>:8088,
click Scheduler.

-->


<!-- notes
   [fail_user@phd11-nn ~]$ id
   uid=507(fail_user) gid=507(failgroup) groups=507(failgroup)
   [fail_user@phd11-nn ~]$ hadoop queue -showacls
-->


<!-- submit
   To submit an application use the parameter
-Dmapred.job.queue.name=<queue-name>
or -Dmapred.job.queuename=<queue-name>
-->





*** yarn-site.xml



$ cat yarn-site.xml

ssprague-mbpro:~ spragues$ cat yarn-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<!--Autogenerated yarn params from puppet yaml hash
yarn_site_parameters__xml -->
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>FOO.sv2.trulia.com</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.nodemanager.local-dirs</name>

<value>/storage0/hadoop/yarn/local,/storage1/hadoop/yarn/local,/storage2/hadoop/yarn/local,/storage3/hadoop/yarn/local,/storage4/hadoop/yarn/local,/storage5/hadoop/yarn/local</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.class</name>

<value>com.pepperdata.supervisor.scheduler.PepperdataSupervisorYarnFair</value>
  </property>
  <property>
    <name>yarn.application.classpath</name>

<value>$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,$TEZ_HOME/*,$TEZ_HOME/lib/*</value>
  </property>
  <property>
    <name>pepperdata.license.key.specification</name>
    <value>data://removed</value>
  </property>
  <property>
    <name>pepperdata.license.key.comments</name>
    <value>License Type: PRODUCTION Expiration Date (UTC): 2017/02/01
Company Name: Trulia, LLC Cluster Name: trulia-production Number of Nodes:
150 Contact Person Name: Deep Varma Contact Person Email: dvarma@trulia.com
</value>
  </property>
  <property>
    <name>yarn.timeline-service.hostname</name>
    <value>FOO.sv2.trulia.com</value>
  </property>
  <property>
    <name>yarn.timeline-service.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.timeline-service.webapp.address</name>
    <value>FOO.sv2.trulia.com:8188</value>
  </property>
  <property>
    <name>yarn.timeline-service.http-cross-origin.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.timeline-service.ttl-enable</name>
    <value>false</value>
  </property>

<!--
  <property>
    <name>yarn.timeline-service.store-class</name>

<value>org.apache.hadoop.yarn.server.timeline.RollingLevelDbTimelineStore</value>
  </property>
-->
  <property>
    <name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.scheduler.fair.user-as-default-queue</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.scheduler.fair.preemption</name>
    <value>false</value>
  </property>
  <property>
    <name>yarn.scheduler.fair.sizebasedweight</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>2048</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>8192</value>
  </property>
  <property>

<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
    <value>98.5</value>
  </property>
  <property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
  </property>
  <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>${yarn.log.dir}/userlogs</value>
  </property>
  <property>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/app-logs</value>
  </property>
  <property>
    <name>yarn.nodemanager.delete.debug-delay-sec</name>
    <value>600</value>
  </property>
  <property>
    <name>yarn.log.server.url</name>
    <value>http://FOO.sv2.trulia.com:19888/jobhistory/logs</value>
  </property>

</configuration>


On Wed, Jan 11, 2017 at 2:27 PM, Akash Mishra <akash.mishra20@gmail.com>
wrote:

> Please post your fair-scheduler.xml file and yarn-site.xml
>
> On Wed, Jan 11, 2017 at 9:14 PM, Stephen Sprague <spragues@gmail.com>
> wrote:
>
>> hey guys,
>> i'm running the RM with the above options (version 2.6.1) and get an NPE
>> upon startup.
>>
>> {code}
>> 17/01/11 12:44:45 FATAL resourcemanager.ResourceManager: Error starting
>> ResourceManager
>> java.lang.NullPointerException
>>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair
>> .a.getName(SourceFile:204)
>>         at org.apache.hadoop.service.CompositeService.addService(Compos
>> iteService.java:73)
>>         at org.apache.hadoop.service.CompositeService.addIfService(Comp
>> ositeService.java:88)
>>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManage
>> r$RMActiveServices.serviceInit(ResourceManager.java:490)
>>         at org.apache.hadoop.service.AbstractService.init(AbstractServi
>> ce.java:163)
>>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManage
>> r.createAndInitActiveServices(ResourceManager.java:993)
>>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManage
>> r.serviceInit(ResourceManager.java:255)
>>         at org.apache.hadoop.service.AbstractService.init(AbstractServi
>> ce.java:163)
>>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManage
>> r.main(ResourceManager.java:1214)
>> 17/01/11 12:44:45 INFO resourcemanager.ResourceManager: SHUTDOWN_MSG:
>> {code}
>>
>> the fair-scheduler.xml file is fine and works in INFO level logging so
>> i'm pretty sure there's nothing "wrong" with it. So with DEBUG level its
>> making this java call and barfing.
>>
>> Any ideas how to fix this?
>>
>> thanks,
>> Stephen.
>>
>
>
>
> --
>
> Regards,
> Akash Mishra.
>
>
> "It's not our abilities that make us, but our decisions."--Albus Dumbledore
>

Mime
View raw message