kylin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "chuxiao (Jira)" <>
Subject [jira] [Created] (KYLIN-4250) FechRunnner should skip the job to process other jobs instead of throwing exception when the job section metadata is not found
Date Tue, 12 Nov 2019 08:47:00 GMT
chuxiao created KYLIN-4250:

             Summary: FechRunnner should skip the job to process other jobs instead of throwing
exception when the job section metadata is not found
                 Key: KYLIN-4250
             Project: Kylin
          Issue Type: Bug
          Components: Job Engine
    Affects Versions: v3.0.0-alpha
            Reporter: chuxiao

Our cluster has two nodes (named build1, build2) building cube jobs, and used DistributedScheduler.

There is a job, id 9f05b84b-cec9-81ee-9336-5a419e451a55, shown built on the build1 node.
The job displays Error, but the first sub task  creating hive flat table display Ready, and
can see the first task's yarn job running through yarn ui. After the yarn job is successful,
the job re-runs the first sub-task,  again and again.

Looking at the build1 log, the status of this job is changed from ready to running, then the
first task status is ready to running, then the update job information is broadcast,  then
the update job information broadcast is received. But after twenty seconds, a broadcast of
the updated job information was received.

After a few minutes, the first task is completed, but the log shows that the job status changed
from Error to ready! Then the job status changed from ready to running, the first task starts
running again .... Repeat the above log.

I suspect that other nodes have changed the job status. Looking at the build2 node log, there
are a lot of exception logs, about there is no output for another job id f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d:

2019-09-20 14:20:58,825 WARN  [pool-10-thread-1] threadpool.DefaultFetcherRunner:90 : Job
Fetcher caught a exception
java.lang.IllegalArgumentException: there is no related output for job id:f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d
        at org.apache.kylin.job.execution.ExecutableManager.getOutputDigest(
        at java.util.concurrent.Executors$
        at java.util.concurrent.FutureTask.runAndReset(
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
        at java.util.concurrent.ScheduledThreadPoolExecutor$
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.util.concurrent.ThreadPoolExecutor$

In addition, each build2 receives the broadcast of the build1 update the job information,
after twenty seconds, the log print changes the first task state runinng to ready and broadcasts.

Restarting the build2 node, not printing the Job Fetcher caught a exception , and the job
9f05b84b-cec9-81ee-9336-5a419e451a55 was successfully executed.

This is due to a job metadata synchronization exception, which triggers a job scheduling bug.
Build1 node try to run the job, but another build node kills the job and changes the job status
to Error, causing problems.

The build2 node may have a metadata synchronization problem, the job with the id f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d
exists in ExecutableDao's executableDigestMap, and does not exist in ExecutableDao's executableOutputDigestMap.
Each time FetchRunner foreach the job, it throws an exception and fetchFailed is set to true.

//throw exception
final Output outputDigest = getExecutableManger().getOutputDigest(id);
        } catch (Throwable th) {
            fetchFailed = true; // this could happen when resource store is unavailable
            logger.warn("Job Fetcher caught a exception ", th);

When the build2 first processes the job that build1 is running, since fetchFailed is true,
the job is not in the list of running jobs in build2, the job status is running, FetchRunner.jobStateCount()
will kill the job, and set the running task status to ready, set the job status to error,

    protected void jobStateCount(String id) {
        final Output outputDigest = getExecutableManger().getOutputDigest(id);
        // logger.debug("Job id:" + id + " not runnable");
        if (outputDigest.getState() == ExecutableState.SUCCEED) {
        } else if (outputDigest.getState() == ExecutableState.ERROR) {
        } else if (outputDigest.getState() == ExecutableState.DISCARDED) {
        } else if (outputDigest.getState() == ExecutableState.STOPPED) {
        } else {
            if (fetchFailed) {
                //this code
            } else {

After the first task of the job runs successfully on the build1, the task state is ready without
change, and the job status is error,and executeResult returns successfully, then the job
status is changed to ready. The job status Ready will not release the zk lock, build1 will
continue to schedule the job to run, and then be killed by build2, again and again. The build
job has not been able to run normally

There are two problems with FetcherRunner:
1. When FechRunnner foreach job, if the metadata of the job part is not found, an exception
will be thrown. We can skip this job and foreach other jobs.

2. For DistributedScheduler, even if FetchFailed is true, not in runningJobs, the status is
running, FetchRunner should not kill the job because the job may be scheduler by another kylin

This jira solves the problem 1, another jira will solves the problem 2

This message was sent by Atlassian Jira

View raw message