oozie-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max Hansmire <hansm...@gmail.com>
Subject Re: Coordinator Fails to start a Coordinator action
Date Sat, 12 Nov 2011 03:44:07 GMT
It is a derby DB.

Max
On Nov 11, 2011, at 9:56 PM, Angelo K. Huang wrote:

> Hi Max,
> 
> What database you are currently using?
> 
> --angelo
> 
> On Fri, Nov 11, 2011 at 6:59 AM, Max Hansmire <hansmire@gmail.com> wrote:
> 
>> Thanks for the help. I have not modified the workflow.xml between
>> successes and failures.
>> 
>> The Coordinator action in the web UI the coordinator action is marked as
>> Killed and has no Ext. Id. Looking at the logs closer it appears that a
>> workflow actually was launched.
>> 
>> That workflow is marked as PREP though. The workflow has no log associated
>> with it.
>> 
>> Here is the snippet of log that shows the Coordinator being failed. This
>> is a different example of the same issue that I saw before.
>> 
>> 2011-11-11 01:14:27,402  WARN CoordActionStartCommand:528 - USER[oozie]
>> GROUP[users] TOKEN[] APP[load_dim_siteinformation]
>> JOB[0000077-111101173146276-oozie-oozi-C]
>> ACTION[0000077-111101173146276-oozie-oozi-C@9] Failing the action
>> 0000077-111101173146276-oozie-oozi-C@9. Because E1005 : E0607: Other
>> error in operation [updateWorkflow], Unable to obtain an object lock on "A
>> lock could not be obtained due to a deadlock, cycle of locks and waiters is:
>> Lock : ROW, WF_JOBS, (2,11)
>> Waiting XID : {4392900, U} , SA, UPDATE WF_JOBS SET app_name = ?,
>> app_path = ?, conf = ?, group_name = ?, run = ?, user_name = ?, auth_token
>> = ?, created_time = ?, end_time = NULL, external_id = NULL,
>> last_modified_time = ?, log_token = ?, proto_action_conf = ?, sla_xml = ?,
>> start_time = ?, status = ?, wf_instance = ? WHERE id IN (SELECT DISTINCT
>> t0.id FROM WF_JOBS t0 WHERE (t0.id = ?) AND t0.bean_type = ?)
>> Granted XID : {4392891, U}
>> Lock : ROW, WF_JOBS, (53,12)
>> Waiting XID : {4392891, S} , SA, UPDATE WF_JOBS SET app_name = ?,
>> app_path = ?, conf = ?, group_name = ?, run = ?, user_name = ?, auth_token
>> = ?, created_time = ?, end_time = NULL, external_id = NULL,
>> last_modified_time = ?, log_token = ?, proto_action_conf = ?, sla_xml = ?,
>> start_time = ?, status = ?, wf_instance = ? WHERE id IN (SELECT DISTINCT
>> t0.id FROM WF_JOBS t0 WHERE (t0.id = ?) AND t0.bean_type = ?)
>> Granted XID : {4392900, X}
>> . The selected victim is XID : 4392900. {prepstmnt 622745466 UPDATE
>> WF_JOBS SET app_name = ?, app_path = ?, conf = ?, group_name = ?, run = ?,
>> user_name = ?, auth_token = ?, created_time = ?, end_time = NULL,
>> external_id = NULL, last_modified_time = ?, log_token = ?,
>> proto_action_conf = ?, sla_xml = ?, start_time = ?, status = ?, wf_instance
>> = ? WHERE id IN (SELECT DISTINCT t0.id FROM WF_JOBS t0 WHERE (t0.id = ?)
>> AND t0.bean_type = ?) [params=(String) load-sql-wf, (String)
>> hdfs://adhocmaster01n:56310/user/oozie/etl/workflows/load_sql/workfl...,
>> (String) <configuration>
>> <property>
>>   <name>oozie.coord.application.pat..., (String) users, (int) 0, (String)
>> oozie, (String) ?, (Timestamp) 2011-11-11 01:14:07.359, (Timestamp)
>> 2011-11-11 01:14:07.381, (String) , (String) <?xml version="1.0"
>> encoding="UTF-8" standalone="no"?><configuration..., (String) , (Timestamp)
>> 2011-11-11 01:14:07.376, (String) RUNNING, (byte[]) [B@7a19ae93, (String)
>> 0000027-111108170726625-oozie-oozi-W, (String) WorkflowJobBean]}
>> [code=30000, state=40001] [java.lang.String]".
>> 2011-11-11 01:14:27,421  INFO CoordActionStartCommand:525 - USER[oozie]
>> GROUP[users] TOKEN[] APP[load_dim_siteinformation]
>> JOB[0000077-111101173146276-oozie-oozi-C]
>> ACTION[0000077-111101173146276-oozie-oozi-C@9] ENDED
>> CoordActionStartCommand  actionId=0000077-111101173146276-oozie-oozi-C@9
>> 
>> I also create a pastebin with the full coordinator log.
>> http://pastebin.com/YAME879Q  It is action 9 that failed. Also action 3
>> failed.
>> 
>> Max
>> 
>> On Nov 11, 2011, at 3:13 AM, Mohammad Islam wrote:
>> 
>>> These two are different.
>>> Coordinator:concurrency is controlling how many coordinator actiond
>> could run simultaneously.
>>> On the other hand, CallableQueue:concurrency control oozie internal
>> queue processing.
>>> 
>>> Most common cases of coordinator action failure is : it was unable to
>> submit the workflow job.
>>> Possibly due to  XML parsing error in workflow.xml.
>>> 
>>> Did you able to submit the exact same workflow.xml w/o coordinator (just
>> for testing).
>>> Alternatively, check the ooize log for any such exception.
>>> 
>>> If nothing give any clue: you can send us the relevant log using
>> pastebin.com
>>> 
>>> Regards,
>>> Mohammad
>>> 
>>> ________________________________
>>> From: kisalay <kisalay@gmail.com>
>>> To: oozie-users@incubator.apache.org
>>> Sent: Thursday, November 10, 2011 9:22 PM
>>> Subject: Re: Coordinator Fails to start a Coordinator action
>>> 
>>> Aaruna, Mohammad,
>>> 
>>> I had too faced a similar issue and upon digging a bit further I zeroed
>> on
>>> the following property in oozie-site.xml
>>> 
>>>    <property>
>>> 
>> <name>oozie.service.CallableQueueService.callable.concurrency</name>
>>>        <value>100</value>
>>>        <description>
>>>            Maximum concurrency for a given callable type.
>>>            Each command is a callable type (submit, start, run, signal,
>>> job, jobs, suspend,resume, etc).
>>>            Each action type is a callable type (Map-Reduce, Pig, SSH,
>> FS,
>>> sub-workflow, etc).
>>>            All commands that use action executors (action-start,
>>> action-end, action-kill and action-check) use
>>>            the action type as the callable type.
>>>        </description>
>>> 
>>> I think the value u mention here determines the maximum concurrenct that
>>> you can set in the coordinator.xml for the workflow.
>>> 
>>> Mohammad, Alejandro,
>>> 
>>> I wanted to know whether the concurrency mentioned in the coordinator.xml
>>> is superseded by the concurrency mentioned in the oozie-site.xml or the
>> two
>>> of the properties are enforced separately ?
>>> 
>>> On Fri, Nov 11, 2011 at 5:29 AM, Max Hansmire <hansmire@gmail.com>
>> wrote:
>>> 
>>>> Yes, the workflow never starts. The Coordinator action is marked as
>> FAILED
>>>> with no Ext. Id.
>>>> 
>>>> oozie.service.CallableQueueService.queue.size=10000. Is this the queue
>>>> size that you are referring to?
>>>> 
>>>> There are about 6 coordinators scheduled to start at the same time as
>> this
>>>> one.
>>>> 
>>>> Max
>>>> 
>>>> On Nov 10, 2011, at 6:30 PM, Mayank Bansal wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> This error only tells that coordinator is not able to acquire a lock
on
>>>>> particular job which is intermittent error and should be resolved by
>> the
>>>>> recovery service of Oozie.
>>>>> 
>>>>> Is that something job is failing or taking more time?
>>>>> 
>>>>> What is the queue size ? Are there lot of jobs running?
>>>>> 
>>>>> Thanks,
>>>>> Mayank
>>>>> 
>>>>> 
>>>>> On Thu, Nov 10, 2011 at 3:25 PM, Max Hansmire <hansmire@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I am running into an issue with oozie coordinators sometimes not
>>>> starting.
>>>>>> It looks like some kind of deadlocking issue. It only happens
>>>>>> intermittently. Is this a known issue?
>>>>>> 
>>>>>> The coordinator has no input datasets. I have included the most
>> relevant
>>>>>> line from the log. I am running Cloudera 3 Beta 4. Let me know if
you
>>>> think
>>>>>> you need more info.
>>>>>> 
>>>>>> Max
>>>>>> 
>>>>>> 2011-11-07 01:02:29,494 ERROR CoordActionMaterializeCommand:522 -
>>>>>> USER[oozie] GROUP[users] TOKEN[] APP[load_dim_traffic]
>>>>>> JOB[0000081-111101173146276-oozie-oozi-C] ACTION[-] XException,
>>>>>> org.apache.oozie.command.CommandException: E1001: Could not read
the
>>>>>> coordinator job definition, E0607: Other error in operation
>> [updateJob],
>>>>>> Unable to obtain an object lock on "A lock could not be obtained
due
>> to
>>>> a
>>>>>> deadlock, cycle of locks and waiters is:
>>>>>> Lock : ROW, COORD_JOBS, (1,27)
>>>>>> 
>>>>>> 
>>>> 
>> 
>> 


Mime
View raw message