Return-Path: X-Original-To: apmail-hive-issues-archive@minotaur.apache.org Delivered-To: apmail-hive-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EB69A18719 for ; Fri, 5 Jun 2015 23:13:00 +0000 (UTC) Received: (qmail 30697 invoked by uid 500); 5 Jun 2015 23:13:00 -0000 Delivered-To: apmail-hive-issues-archive@hive.apache.org Received: (qmail 30570 invoked by uid 500); 5 Jun 2015 23:13:00 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 30399 invoked by uid 99); 5 Jun 2015 23:13:00 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Jun 2015 23:13:00 +0000 Date: Fri, 5 Jun 2015 23:13:00 +0000 (UTC) From: "Richard Williams (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-10410) Apparent race condition in HiveServer2 causing intermittent query failures MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-10410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575389#comment-14575389 ] Richard Williams commented on HIVE-10410: ----------------------------------------- [~thejas] In the test case I use to reproduce this issue, I'm not sharing a session among multiple queries or setting hive.exec.parallel=true. I am, however, using the same session across multiple calls to the HiveServer2 Thrift API--in particular, I'm running many concurrent processes that each open a session, execute an asynchronous operation, poll HiveServer2 for the status of that operation until it completes, and then close their session. I think that what's happening is that the foreground thread is continuing to talk to the metastore using its Hive object while the pooled background thread is executing the asynchronous operation (and thus also using the same Hive object). Right, the issue you mentioned is what I was talking about in my earlier comment--the patch I uploaded relies on the UserGroupInformation.doAs call in the submitted background operation to ensure that the current user in the background thread is the same as the current user in the foreground thread. Thus, when the background thread calls Hive.get, the call to isCurrentUserOwner will return false if the existing MetaStoreClient is associated with the wrong user and a new connection, this time associated with the correct user, will be created. Presumably the code that does this is the safeguard you're referring to? {noformat} public static Hive get(HiveConf c, boolean needsRefresh) throws HiveException { Hive db = hiveDB.get(); if (db == null || needsRefresh || !db.isCurrentUserOwner()) { if (db != null) { LOG.debug("Creating new db. db = " + db + ", needsRefresh = " + needsRefresh + ", db.isCurrentUserOwner = " + db.isCurrentUserOwner()); } closeCurrent(); c.set("fs.scheme.class", "dfs"); Hive newdb = new Hive(c); hiveDB.set(newdb); return newdb; } {noformat} We haven't noticed any performance issues attributable to frequent reconnections to the metastore to as a result of this change. Then again, we probably wouldn't; we're using Sentry and thus have HiveServer2 impersonation disabled, so I would expect the current user to always be the user running the HiveServer2 process. When I get a chance, I'll try changing Hive.createMetaStoreClient to wrap the client it creates using HiveMetaStoreClient.newSynchronizedClient instead of removing the code that sets the background thread's Hive object. > Apparent race condition in HiveServer2 causing intermittent query failures > -------------------------------------------------------------------------- > > Key: HIVE-10410 > URL: https://issues.apache.org/jira/browse/HIVE-10410 > Project: Hive > Issue Type: Bug > Components: HiveServer2 > Affects Versions: 0.13.1 > Environment: CDH 5.3.3 > CentOS 6.4 > Reporter: Richard Williams > Attachments: HIVE-10410.1.patch > > > On our secure Hadoop cluster, queries submitted to HiveServer2 through JDBC occasionally trigger odd Thrift exceptions with messages such as "Read a negative frame size (-2147418110)!" or "out of sequence response" in HiveServer2's connections to the metastore. For certain metastore calls (for example, showDatabases), these Thrift exceptions are converted to MetaExceptions in HiveMetaStoreClient, which prevents RetryingMetaStoreClient from retrying these calls and thus causes the failure to bubble out to the JDBC client. > Note that as far as we can tell, this issue appears to only affect queries that are submitted with the runAsync flag on TExecuteStatementReq set to true (which, in practice, seems to mean all JDBC queries), and it appears to only manifest when HiveServer2 is using the new HTTP transport mechanism. When both these conditions hold, we are able to fairly reliably reproduce the issue by spawning about 100 simple, concurrent hive queries (we have been using "show databases"), two or three of which typically fail. However, when either of these conditions do not hold, we are no longer able to reproduce the issue. > Some example stack traces from the HiveServer2 logs: > {noformat} > 2015-04-16 13:54:55,486 ERROR hive.log: Got exception: org.apache.thrift.transport.TTransportException Read a negative frame size (-2147418110)! > org.apache.thrift.transport.TTransportException: Read a negative frame size (-2147418110)! > at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:435) > at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:414) > at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37) > at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) > at org.apache.hadoop.hive.thrift.TFilterTransport.readAll(TFilterTransport.java:62) > at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) > at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) > at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) > at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_databases(ThriftHiveMetastore.java:600) > at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_databases(ThriftHiveMetastore.java:587) > at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabases(HiveMetaStoreClient.java:837) > at org.apache.sentry.binding.metastore.SentryHiveMetaStoreClient.getDatabases(SentryHiveMetaStoreClient.java:60) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:90) > at com.sun.proxy.$Proxy6.getDatabases(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getDatabasesByPattern(Hive.java:1139) > at org.apache.hadoop.hive.ql.exec.DDLTask.showDatabases(DDLTask.java:2445) > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:364) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) > at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1554) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1321) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1139) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:962) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:957) > at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:145) > at org.apache.hive.service.cli.operation.SQLOperation.access$000(SQLOperation.java:69) > at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:200) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) > at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:502) > at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:213) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The above exception being converted into a MetaException and re-thrown: > {noformat} > 2015-04-16 13:54:55,486 ERROR hive.ql.exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: org.apache.thrift.transport.TTransportException Read a negative frame size (-2147418110)!) > at org.apache.hadoop.hive.ql.metadata.Hive.getDatabasesByPattern(Hive.java:1141) > at org.apache.hadoop.hive.ql.exec.DDLTask.showDatabases(DDLTask.java:2445) > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:364) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) > at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1554) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1321) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1139) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:962) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:957) > at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:145) > at org.apache.hive.service.cli.operation.SQLOperation.access$000(SQLOperation.java:69) > at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:200) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) > at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:502) > at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:213) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: MetaException(message:Got exception: org.apache.thrift.transport.TTransportException Read a negative frame size (-2147418110)!) > at org.apache.hadoop.hive.metastore.MetaStoreUtils.logAndThrowMetaException(MetaStoreUtils.java:1116) > at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabases(HiveMetaStoreClient.java:839) > at org.apache.sentry.binding.metastore.SentryHiveMetaStoreClient.getDatabases(SentryHiveMetaStoreClient.java:60) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:90) > at com.sun.proxy.$Proxy6.getDatabases(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getDatabasesByPattern(Hive.java:1139) > ... 22 more > {noformat} > The above MetaException causing the query as a whole to fail: > {noformat} > 2015-04-16 13:54:55,486 ERROR org.apache.hive.service.cli.operation.Operation: Error running hive query: > org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: org.apache.thrift.transport.TTransportException Read a negative frame size (-2147418110)!) > at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:148) > at org.apache.hive.service.cli.operation.SQLOperation.access$000(SQLOperation.java:69) > at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:200) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) > at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:502) > at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:213) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > An "out of sequence response" that occurred shortly after the above exception and may have been triggered by it: > {noformat} > 2015-04-16 13:54:55,498 ERROR hive.log: Got exception: org.apache.thrift.TApplicationException get_databases failed: out of sequence response > org.apache.thrift.TApplicationException: get_databases failed: out of sequence response > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:76) > at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_databases(ThriftHiveMetastore.java:600) > at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_databases(ThriftHiveMetastore.java:587) > at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabases(HiveMetaStoreClient.java:837) > at org.apache.sentry.binding.metastore.SentryHiveMetaStoreClient.getDatabases(SentryHiveMetaStoreClient.java:60) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:90) > at com.sun.proxy.$Proxy6.getDatabases(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getDatabasesByPattern(Hive.java:1139) > at org.apache.hadoop.hive.ql.exec.DDLTask.showDatabases(DDLTask.java:2445) > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:364) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) > at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1554) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1321) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1139) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:962) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:957) > at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:145) > at org.apache.hive.service.cli.operation.SQLOperation.access$000(SQLOperation.java:69) > at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:200) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) > at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:502) > at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:213) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)