helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HELIX-655) Helix per-participant concurrent task throttling
Date Mon, 15 May 2017 18:49:04 GMT

    [ https://issues.apache.org/jira/browse/HELIX-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16011105#comment-16011105
] 

ASF GitHub Bot commented on HELIX-655:
--------------------------------------

Github user jiajunwang commented on a diff in the pull request:

    https://github.com/apache/helix/pull/89#discussion_r116571005
  
    --- Diff: helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java
---
    @@ -90,60 +96,86 @@ private BestPossibleStateOutput compute(ClusterEvent event, Map<String,
Resource
     
         BestPossibleStateOutput output = new BestPossibleStateOutput();
     
    -    for (String resourceName : resourceMap.keySet()) {
    -      logger.debug("Processing resource:" + resourceName);
    +    // Reset current INIT/RUNNING tasks on participants for throttling
    +    JobRebalancer.resetActiveTaskCount(cache.getLiveInstances().keySet(), currentStateOutput);
     
    -      Resource resource = resourceMap.get(resourceName);
    -      // Ideal state may be gone. In that case we need to get the state model name
    -      // from the current state
    -      IdealState idealState = cache.getIdealState(resourceName);
    +    PriorityQueue<JobResourcePriority> jobResourceQueue = new PriorityQueue<JobResourcePriority>();
    --- End diff --
    
    Got your point.
    But we don't need to use all features of the queue to leverage it here.
    If the performance is bad, we shall consider an alternative. But that's not the case.
    And even we use "sortedSet", there is no usage of a "set" here. So, to me, there is no
different.


> Helix per-participant concurrent task throttling
> ------------------------------------------------
>
>                 Key: HELIX-655
>                 URL: https://issues.apache.org/jira/browse/HELIX-655
>             Project: Apache Helix
>          Issue Type: New Feature
>          Components: helix-core
>    Affects Versions: 0.6.x
>            Reporter: Jiajun Wang
>            Assignee: Junkai Xue
>
> h1. Overview
> Currently, all runnable jobs/tasks in Helix are equally treated. They are all scheduled
according to the rebalancer algorithm. Means, their assignment might be different, but they
will all be in RUNNING state.
> This may cause an issue if there are too many concurrently runnable jobs. When Helix
controller starts all these jobs, the instances may be overload as they are assigning resources
and executing all the tasks. As a result, the jobs won't be able to finish in a reasonable
time window.
> The issue is even more critical to long run jobs. According to our meeting with Gobblin
team, when a job is scheduled, they allocate resource for the job. So in the situation described
above, more and more resources will be reserved for the pending jobs. The cluster will soon
be exhausted.
> For solving the problem, an application needs to schedule jobs in a relatively low frequency
(what Gobblin is doing now). This may cause low utilization.
> A better way to fix this issue, at framework level, is throttling jobs/tasks that are
running concurrently, and allowing setting priority for different jobs to control total execute
time.
> So given same amount of jobs, the cluster is in a better condition. As a result, jobs
running in that cluster have a more controllable execute time.
> Existing related control mechanisms are:
> * ConcurrentTasksPerInstance for each job
> * ParallelJobs for each workflow
> * Threadpool limitation on the participant if user customizes TaskStateModelFactory.
> But none of them can directly help when concurrent workflows or jobs number is large.
If an application keeps scheduling jobs/jobQueues, Helix will start any runnable jobs without
considering the workload on the participants.
> The application may be able to carefully configures these items to achieve the goal.
But they won't be able to easily find the sweet spot. Especially the cluster might be changing
(scale out etc.).
> h2. Problem summary
> # All runnable tasks will start executing, which may overload the participant.
> # Application needs a mechanism to prioritize important jobs (or workflows). Otherwise,
important tasks may be blocked by other less important ones. And allocated resource is wasted.
> h2. Feature proposed
> Based on our discussing, we proposed 2 features that can help to resolve the issue.
> # Running task throttling on each participant. This is for avoiding overload.
> # Job priority control that ensures high priority jobs are scheduled earlier.
> In addition, application can leverage workflow/job monitor items as feedback from Helix
to adjust their stretgy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message