incubator-oozie-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Islam <misla...@yahoo.com>
Subject Re: Coordinator Fails to start a Coordinator action
Date Fri, 11 Nov 2011 08:13:50 GMT
These two are different.
Coordinator:concurrency is controlling how many coordinator actiond could run simultaneously.
On the other hand, CallableQueue:concurrency control oozie internal queue processing.

Most common cases of coordinator action failure is : it was unable to submit the workflow
job.
Possibly due to  XML parsing error in workflow.xml.

Did you able to submit the exact same workflow.xml w/o coordinator (just for testing).
Alternatively, check the ooize log for any such exception.

If nothing give any clue: you can send us the relevant log using pastebin.com 
 
Regards,
Mohammad

________________________________
From: kisalay <kisalay@gmail.com>
To: oozie-users@incubator.apache.org
Sent: Thursday, November 10, 2011 9:22 PM
Subject: Re: Coordinator Fails to start a Coordinator action

Aaruna, Mohammad,

I had too faced a similar issue and upon digging a bit further I zeroed on
the following property in oozie-site.xml

    <property>
        <name>oozie.service.CallableQueueService.callable.concurrency</name>
        <value>100</value>
        <description>
            Maximum concurrency for a given callable type.
            Each command is a callable type (submit, start, run, signal,
job, jobs, suspend,resume, etc).
            Each action type is a callable type (Map-Reduce, Pig, SSH, FS,
sub-workflow, etc).
            All commands that use action executors (action-start,
action-end, action-kill and action-check) use
            the action type as the callable type.
        </description>

I think the value u mention here determines the maximum concurrenct that
you can set in the coordinator.xml for the workflow.

Mohammad, Alejandro,

I wanted to know whether the concurrency mentioned in the coordinator.xml
is superseded by the concurrency mentioned in the oozie-site.xml or the two
of the properties are enforced separately ?

On Fri, Nov 11, 2011 at 5:29 AM, Max Hansmire <hansmire@gmail.com> wrote:

> Yes, the workflow never starts. The Coordinator action is marked as FAILED
> with no Ext. Id.
>
> oozie.service.CallableQueueService.queue.size=10000. Is this the queue
> size that you are referring to?
>
> There are about 6 coordinators scheduled to start at the same time as this
> one.
>
> Max
>
> On Nov 10, 2011, at 6:30 PM, Mayank Bansal wrote:
>
> > Hi,
> >
> > This error only tells that coordinator is not able to acquire a lock on
> > particular job which is intermittent error and should be resolved by the
> > recovery service of Oozie.
> >
> > Is that something job is failing or taking more time?
> >
> > What is the queue size ? Are there lot of jobs running?
> >
> > Thanks,
> > Mayank
> >
> >
> > On Thu, Nov 10, 2011 at 3:25 PM, Max Hansmire <hansmire@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> I am running into an issue with oozie coordinators sometimes not
> starting.
> >> It looks like some kind of deadlocking issue. It only happens
> >> intermittently. Is this a known issue?
> >>
> >> The coordinator has no input datasets. I have included the most relevant
> >> line from the log. I am running Cloudera 3 Beta 4. Let me know if you
> think
> >> you need more info.
> >>
> >> Max
> >>
> >> 2011-11-07 01:02:29,494 ERROR CoordActionMaterializeCommand:522 -
> >> USER[oozie] GROUP[users] TOKEN[] APP[load_dim_traffic]
> >> JOB[0000081-111101173146276-oozie-oozi-C] ACTION[-] XException,
> >> org.apache.oozie.command.CommandException: E1001: Could not read the
> >> coordinator job definition, E0607: Other error in operation [updateJob],
> >> Unable to obtain an object lock on "A lock could not be obtained due to
> a
> >> deadlock, cycle of locks and waiters is:
> >> Lock : ROW, COORD_JOBS, (1,27)
> >>
> >>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message