hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits
Date Thu, 22 May 2008 08:57:56 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598949#action_12598949
] 

Hemanth Yamijala commented on HADOOP-3376:
------------------------------------------

Some comments:

- I think job-feasibility-attr should be optional. Some code which depends on this attribute
may need to check for it or change to handle it if it's not defined:
In torque.py.isJobFeasible, if the job-feasibility-attr is not defined, we would get an exception,
where the info message being printed is not going to be very descriptive. I think it would
just print 'job-feasibility-attr' and not information about what the error is.
__check_job_state: doesn't handle case where job-feasibility-attr is not defined.

- The messages now read as follows:
(In case of req. resources > max resources):
Request exceeded maximum user limits. CurentUsage:%s, Requested:%s, MaxLimit:%s
(In other case):
Request exceeded maximum user limits. CurentUsage:3, Requested:3, MaxLimit:3 This cluster
will remain queued till old clusters free resources.
The message still does not clarify the resources being exceeded.

I suggest the following:
Request number of nodes exceeded maximum user limits. Current Usage:%s, Requested:%s, Maximum
Limit:%s. This cluster cannot be allocated now.

and

Request number of nodes exceeded maximum user limits. Current Usage:%s, Requested:%s, Maximum
Limit:%s. This cluster allocation will succeed only after other clusters are deallocated.
(Note: I also corrected some typos in the message)

- The executable bit is not being turned on for support/checklimits.sh. This is mostly due
to a bug in the ant script. For code under the contrib projects, only files under the bin/
folder are made executable when packaged. As this is not a bug in HOD, I think we should leave
this as it is, but update the usage documentation to make it executable.

- In checklimits.sh - the sleep at the end is not required.

- In case when current usage + requested usage exceeds limits, the critical message is printed
every 10 seconds. It should be printed only once.

Other than these, I tested checklimits and hod for both scenarios and it works fine.

> [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource
manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1
>
>
> Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD
cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked
till resources occupied by old clusters get freed. HOD should detect these scenarios and deal
intelligently, instead of just waiting for a long time/ for ever. This means more and proper
information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager queue preventing
other users from using the queue. To avoid this, we could have various types of limits setup
in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc
limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one
problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate
number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed
max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are
already used by jobs from the same user. This issue addresses how to deal with scenarios like
these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message