Return-Path: X-Original-To: apmail-ambari-dev-archive@www.apache.org Delivered-To: apmail-ambari-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8EA31182BD for ; Wed, 5 Aug 2015 22:45:06 +0000 (UTC) Received: (qmail 33843 invoked by uid 500); 5 Aug 2015 22:45:05 -0000 Delivered-To: apmail-ambari-dev-archive@ambari.apache.org Received: (qmail 33616 invoked by uid 500); 5 Aug 2015 22:45:05 -0000 Mailing-List: contact dev-help@ambari.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ambari.apache.org Delivered-To: mailing list dev@ambari.apache.org Received: (qmail 33412 invoked by uid 99); 5 Aug 2015 22:45:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Aug 2015 22:45:05 +0000 Date: Wed, 5 Aug 2015 22:45:05 +0000 (UTC) From: "Hudson (JIRA)" To: dev@ambari.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (AMBARI-12570) Cluster creates stuck at 9x% (deadlock sql exception) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/AMBARI-12570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659114#comment-14659114 ] Hudson commented on AMBARI-12570: --------------------------------- FAILURE: Integrated in Ambari-branch-2.1 #330 (See [https://builds.apache.org/job/Ambari-branch-2.1/330/]) AMBARI-12570 - Cluster creates stuck at 9x% (deadlock sql exception) (part3) (jonathanhurley) (jhurley: http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=376c49750e61088edd392ac262c7ea7d2ab4a354) * ambari-server/src/test/java/org/apache/ambari/server/upgrade/UpgradeCatalog211Test.java > Cluster creates stuck at 9x% (deadlock sql exception) > ----------------------------------------------------- > > Key: AMBARI-12570 > URL: https://issues.apache.org/jira/browse/AMBARI-12570 > Project: Ambari > Issue Type: Bug > Affects Versions: 2.1.0 > Reporter: Jonathan Hurley > Assignee: Jonathan Hurley > Priority: Blocker > Fix For: 2.2.0 > > Attachments: AMBARI-12570.patch, AMBARI-12570.patch.1 > > > Similar to AMBARI-12526, Ambari installation via a blueprint on SQL Azure gets stuck somewhere between 90% and 100% because of a SQL Database deadlock. > This is always between {{hostcomponentstate.current_state}} and {{hostcomponentstate.version}}. > {code} > Rollback reason: > Local Exception Stack: > Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.5.2.v20140319-9ad6abd): org.eclipse.persistence.exceptions.DatabaseException > Internal Exception: com.microsoft.sqlserver.jdbc.SQLServerException: Transaction (Process ID 62) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. > Error Code: 1205 > Call: UPDATE hostcomponentstate SET current_state = ? WHERE ((((component_name = ?) AND (host_id = ?)) AND (cluster_id = ?)) AND (service_name = ?)) > bind => [5 parameters bound] > at org.eclipse.persistence.exceptions.DatabaseException.sqlException(DatabaseException.java:331) > at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeDirectNoSelect(DatabaseAccessor.java:900) > at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeNoSelect(DatabaseAccessor.java:962) > at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.basicExecuteCall(DatabaseAccessor.java:631) > at org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatch(ParameterizedSQLBatchWritingMechanism.java:149) > at org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatchedStatements(ParameterizedSQLBatchWritingMechanism.java:134) > at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.writesCompleted(DatabaseAccessor.java:1836) > at org.eclipse.persistence.internal.sessions.AbstractSession.writesCompleted(AbstractSession.java:4244) > at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.writesCompleted(UnitOfWorkImpl.java:5594) > at org.eclipse.persistence.internal.sessions.RepeatableWriteUnitOfWork.writeChanges(RepeatableWriteUnitOfWork.java:453) > at org.eclipse.persistence.internal.jpa.EntityManagerImpl.flush(EntityManagerImpl.java:863) > at org.eclipse.persistence.internal.jpa.QueryImpl.performPreQueryFlush(QueryImpl.java:963) > at org.eclipse.persistence.internal.jpa.QueryImpl.executeReadQuery(QueryImpl.java:207) > at org.eclipse.persistence.internal.jpa.QueryImpl.getSingleResult(QueryImpl.java:517) > at org.eclipse.persistence.internal.jpa.EJBQueryImpl.getSingleResult(EJBQueryImpl.java:400) > at org.apache.ambari.server.orm.dao.DaoUtils.selectOne(DaoUtils.java:80) > at org.apache.ambari.server.orm.dao.StackDAO.find(StackDAO.java:93) > at org.apache.ambari.server.orm.AmbariLocalSessionInterceptor.invoke(AmbariLocalSessionInterceptor.java:53) > at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.setStackVersion(ServiceComponentHostImpl.java:1058) > at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl$ServiceComponentHostOpStartedTransition.transition(ServiceComponentHostImpl.java:628) > at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl$ServiceComponentHostOpStartedTransition.transition(ServiceComponentHostImpl.java:610) > at org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:354) > at org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(StateMachineFactory.java:294) > at org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(StateMachineFactory.java:39) > at org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:440) > at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.handleEvent(ServiceComponentHostImpl.java:901) > at org.apache.ambari.server.state.cluster.ClusterImpl.processServiceComponentHostEvents(ClusterImpl.java:2508) > at org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.invoke(AmbariJpaLocalTxnInterceptor.java:68) > at org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:343) > at org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:195) > at java.lang.Thread.run(Thread.java:745) > {code} > - We have dual X-locks on {{hostcomponentstate}} asking for U-locks when updating the CLUSTERED INDEX. > - Both dual X-locks, from different transactions and different processes, are on the same row (technically impossible) - based on the XML execution plan, we can see that the concurrent UPDATE statements are executing on different rows due to their CLUSTERED INDEX predicate. > - In Java, Ambari has locks which prevent concurrent U- or X-locks on the same row > - Only happens on SQL Server > My best suspicion right now is that we have a key hash collision happening on this table. That's why two processes appear to have the same lock even though they are on different rows. > I was able to use a database dump that I took to compare hash values from {{hostcomponentstate}}: > {code:sql} > SELECT %%lockres%% as lock_hash, cluster_id, host_id, service_name, component_name > FROM hostcomponentstate > ORDER BY host_id, service_name, %%lockres%% > {code} > {code} > lock_hash cluster_id host_id service_name component_name > (0d4a8b0869f5) 2 1 HDFS SECONDARY_NAMENODE > (0d4a8b0869f5) 2 1 HDFS HDFS_CLIENT > (99fb0081b824) 2 1 MAPREDUCE2 HISTORYSERVER > (7086998db3dc) 2 1 YARN APP_TIMELINE_SERVER > (7086998db3dc) 2 1 YARN RESOURCEMANAGER > (3bef52323322) 2 1 ZOOKEEPER ZOOKEEPER_SERVER > ... > {code} > SQL Server is producing lock hashes that collide! It seems like the issue here is that we are using a CLUSTERED INDEX on 4 columns, 3 of which are always the same (cluster, host, service) in many cases. The only variable is the component name. When the hash gets truncated to 6 bytes, we get duplicates. > So, I think this totally aligns with my suspicions as to why this is only a SQL Server problem since other database don't lock like this. It also makes sense that this is the only table this happens on since we are not using a surrogate PK here. I think we have a few options here: > - Add more columns into the CLUSTERED INDEX in hopes we get a more unique hash. The problem is that the other columns are also basically the same. > - Change the CLUSTERED INDEX to an UNCLUSTERED INDEX (since this is our main query criteria) and use a single, unique BIGINT PK as we do for many other tables. I'm just not sure how SQL Server locks on a row when there is a CLUSTERED INDEX which is not part of the predicate. > - Remove the CLUSTERED INDEX entirely (performance would probably tank) > - It's possible we can try to partition this table differently so that lock space is more unique. > - Change the existing CLUSTERED INDEX so that it disallows row and page level locks, forcing all X-locks to be table-level locks. This would, in theory, prevent the deadlock and would not require us to change any data. But it would introduce a bottleneck on the table for anything more than a single read. -- This message was sent by Atlassian JIRA (v6.3.4#6332)