ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <>
Subject [jira] [Commented] (AMBARI-12570) Cluster creates stuck at 9x% (deadlock sql exception)
Date Tue, 04 Aug 2015 20:52:04 GMT


Hadoop QA commented on AMBARI-12570:

{color:green}+1 overall{color}.  Here are the results of testing the latest attachment
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 5 new or modified
test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of
javac compiler warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number
of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results:
Console output:

This message is automatically generated.

> Cluster creates stuck at 9x% (deadlock sql exception)
> -----------------------------------------------------
>                 Key: AMBARI-12570
>                 URL:
>             Project: Ambari
>          Issue Type: Bug
>    Affects Versions: 2.1.0
>            Reporter: Jonathan Hurley
>            Assignee: Jonathan Hurley
>            Priority: Blocker
>             Fix For: 2.1.1
>         Attachments: AMBARI-12570.patch, AMBARI-12570.patch.1
> Similar to AMBARI-12526, Ambari installation via a blueprint on SQL Azure gets stuck
somewhere between 90% and 100% because of a SQL Database deadlock. 
> This is always between {{hostcomponentstate.current_state}} and {{hostcomponentstate.version}}.

> {code}
> Rollback reason: 
> Local Exception Stack: 
> Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.5.2.v20140319-9ad6abd):
> Internal Exception: Transaction (Process
ID 62) was deadlocked on lock resources with another process and has been chosen as the deadlock
victim. Rerun the transaction.
> Error Code: 1205
> Call: UPDATE hostcomponentstate SET current_state = ? WHERE ((((component_name = ?) AND
(host_id = ?)) AND (cluster_id = ?)) AND (service_name = ?))
> 	bind => [5 parameters bound]
> 	at org.eclipse.persistence.exceptions.DatabaseException.sqlException(
> 	at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeDirectNoSelect(
> 	at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeNoSelect(
> 	at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.basicExecuteCall(
> 	at org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatch(
> 	at org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatchedStatements(
> 	at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.writesCompleted(
> 	at org.eclipse.persistence.internal.sessions.AbstractSession.writesCompleted(
> 	at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.writesCompleted(
> 	at org.eclipse.persistence.internal.sessions.RepeatableWriteUnitOfWork.writeChanges(
> 	at org.eclipse.persistence.internal.jpa.EntityManagerImpl.flush(
> 	at org.eclipse.persistence.internal.jpa.QueryImpl.performPreQueryFlush(
> 	at org.eclipse.persistence.internal.jpa.QueryImpl.executeReadQuery(
> 	at org.eclipse.persistence.internal.jpa.QueryImpl.getSingleResult(
> 	at org.eclipse.persistence.internal.jpa.EJBQueryImpl.getSingleResult(
> 	at org.apache.ambari.server.orm.dao.DaoUtils.selectOne(
> 	at org.apache.ambari.server.orm.dao.StackDAO.find(
> 	at org.apache.ambari.server.orm.AmbariLocalSessionInterceptor.invoke(
> 	at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.setStackVersion(
> 	at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl$ServiceComponentHostOpStartedTransition.transition(
> 	at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl$ServiceComponentHostOpStartedTransition.transition(
> 	at org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(
> 	at org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(
> 	at org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(
> 	at org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(
> 	at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.handleEvent(
> 	at org.apache.ambari.server.state.cluster.ClusterImpl.processServiceComponentHostEvents(
> 	at org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.invoke(
> 	at org.apache.ambari.server.actionmanager.ActionScheduler.doWork(
> 	at
> 	at
> {code}
> - We have dual X-locks on {{hostcomponentstate}} asking for U-locks when updating the
> - Both dual X-locks, from different transactions and different processes, are on the
same row (technically impossible) - based on the XML execution plan, we can see that the concurrent
UPDATE statements are executing on different rows due to their CLUSTERED INDEX predicate.
> - In Java, Ambari has locks which prevent concurrent U- or X-locks on the same row
> - Only happens on SQL Server
> My best suspicion right now is that we have a key hash collision happening on this table.
That's why two processes appear to have the same lock even though they are on different rows.

> I was able to use a database dump that I took to compare hash values from {{hostcomponentstate}}:
> {code:sql}
> SELECT  %%lockres%% as lock_hash, cluster_id, host_id, service_name, component_name
>   FROM hostcomponentstate 
>   ORDER BY host_id, service_name, %%lockres%%
> {code}
> {code}
> lock_hash	cluster_id	host_id	service_name	component_name
> (0d4a8b0869f5)	2	1	HDFS	SECONDARY_NAMENODE
> (0d4a8b0869f5)	2	1	HDFS	HDFS_CLIENT
> (99fb0081b824)	2	1	MAPREDUCE2	HISTORYSERVER
> (7086998db3dc)	2	1	YARN	APP_TIMELINE_SERVER
> (7086998db3dc)	2	1	YARN	RESOURCEMANAGER
> ...
> {code}
> SQL Server is producing lock hashes that collide! It seems like the issue here is that
we are using a CLUSTERED INDEX on 4 columns, 3 of which are always the same (cluster, host,
service) in many cases. The only variable is the component name. When the hash gets truncated
to 6 bytes, we get duplicates. 
> So, I think this totally aligns with my suspicions as to why this is only a SQL Server
problem since other database don't lock like this. It also makes sense that this is the only
table this happens on since we are not using a surrogate PK here. I think we have a few options
> - Add more columns into the CLUSTERED INDEX in hopes we get a more unique hash. The problem
is that the other columns are also basically the same.
> - Change the CLUSTERED INDEX to an UNCLUSTERED INDEX (since this is our main query criteria)
and use a single, unique BIGINT PK as we do for many other tables. I'm just not sure how SQL
Server locks on a row when there is a CLUSTERED INDEX which is not part of the predicate.
> - Remove the CLUSTERED INDEX entirely (performance would probably tank)
> - It's possible we can try to partition this table differently so that lock space is
more unique.
> - Change the existing CLUSTERED INDEX so that it disallows row and page level locks,
forcing all X-locks to be table-level locks. This would, in theory, prevent the deadlock and
would not require us to change any data. But it would introduce a bottleneck on the table
for anything more than a single read.

This message was sent by Atlassian JIRA

View raw message