Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0FB8810869 for ; Tue, 22 Oct 2013 07:47:53 +0000 (UTC) Received: (qmail 44621 invoked by uid 500); 22 Oct 2013 07:47:48 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 44440 invoked by uid 500); 22 Oct 2013 07:47:47 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 44425 invoked by uid 500); 22 Oct 2013 07:47:45 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 44413 invoked by uid 99); 22 Oct 2013 07:47:43 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Oct 2013 07:47:43 +0000 Date: Tue, 22 Oct 2013 07:47:43 +0000 (UTC) From: "Navis (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-5172) TUGIContainingTransport returning null transport, causing intermittent SocketTimeoutException on hive client and NullPointerException in TUGIBasedProcessor on the server MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-5172?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13801= 584#comment-13801584 ]=20 Navis commented on HIVE-5172: ----------------------------- As mentioned above by Ashutosh, current patch is not complete fix for this = problem. So I've made separate issue and committed to trunk for a temporal = mending. > TUGIContainingTransport returning null transport, causing intermittent So= cketTimeoutException on hive client and NullPointerException in TUGIBasedPr= ocessor on the server > -------------------------------------------------------------------------= ---------------------------------------------------------------------------= --------------------- > > Key: HIVE-5172 > URL: https://issues.apache.org/jira/browse/HIVE-5172 > Project: Hive > Issue Type: Bug > Components: Metastore > Affects Versions: 0.9.0, 0.10.0, 0.11.0 > Reporter: agate > Attachments: HIVE-5172.1.patch.txt > > > We are running into frequent problem using HCatalog 0.4.1 (Hive Metastore= Server 0.9) where we get connection reset or connection timeout errors on = the client and NullPointerException in TUGIBasedProcessor on the server.=20 > {code} > hive client logs: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutEx= ception: Read timed out > at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport= .java:129) > at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) > at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.jav= a:378) > at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.jav= a:297) > at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryPro= tocol.java:204) > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) > at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_s= et_ugi(ThriftHiveMetastore.java:2136) > at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ug= i(ThriftHiveMetastore.java:2122) > at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMet= aStoreClient.java:286) > at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStor= eClient.java:197) > at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaSt= oreClient.java:157) > at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.jav= a:2092) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102) > at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888) > at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:830) > at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticA= nalyzer.java:954) > at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(Seman= ticAnalyzer.java:7524) > at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSeman= ticAnalyzer.java:243) > at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431) > at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) > at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:25= 8) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341) > at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j= ava:39) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess= orImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > Caused by: java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:129) > at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport= .java:127) > ... 31 more > {code} > {code} > hive metastore server logs: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D > 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer (TThreadPoolServer= .java:run(182)) - Error occurred during processing of message. > java.lang.NullPointerException > at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddre= ss(TUGIBasedProcessor.java:183) > at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TU= GIBasedProcessor.java:79) > at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(T= ThreadPoolServer.java:176) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecu= tor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.= java:908) > at java.lang.Thread.run(Thread.java:662) > {code} > Adding some extra debug log messages in TUGIBasedProcessor, noticed that = the TUGIContainingTransport is null which results in NullPointerException o= n the server. > Further drilling into TUGIContainingTransport noticed that getTransport()= returns a null which causes the above=20 > error.=20 > Further corelating with GC logs observed that that error always hits when= the CMS GC has just kicked in=20 > (but does not happen after every GC) > Put some debugging code in TUGIContainingTransport.getTransport() and I t= racked it down to=20 > {code} > @Override > public TUGIContainingTransport getTransport(TTransport trans) { > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82// UGI information = is not available at connection setup time, it will be set later > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82// via set_ugi() rp= c. > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82transMap.putIfAbsen= t(trans, new TUGIContainingTransport(trans)); > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82//return transMap.g= et(trans); //<-change > TUGIContainingTransport retTrans =3D transMap.get(trans); > if ( retTrans =3D=3D null ) { > LOGGER.error ("cannot find transport that was in map !!") > =09 return retTrans; > } else { > LOGGER.debug ("found transport in map") > return retTrans; =20 > } > } > {code} > When we run this in our test environment, see that we run into the proble= m just after GC runs, > and "cannot find transport that was in the map!!" message gets logged. > Could the GC be collecting entries from transMap, just before the we get = it=20 > Tried a minor change which seems to work > {code} > public TUGIContainingTransport getTransport(TTransport trans) { > TUGIContainingTransport tugiTrans =3D transMap.get(trans); > if ( tugiTrans =3D=3D null ) { > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82// UGI information = is not available at connection setup time, it will be set later > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82// via set_ugi() rp= c. > tugiTrans =3D new TUGIContainingTransport(trans); //get a new TUGI= ContainingTransport > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82 =E2=80=82transMap.putIfAbs= ent(trans, tugiTrans); > } > return tugiTrans; > } > {code} > My questions for hive and thrift experts > 1.) Do we need to use a ConcurrentMap > {code} > ConcurrentMap transMap =3D new MapMa= ker().weakKeys().weakValues().makeMap(); > {code} > It does use =3D=3D to compare keys (which might be the problem), also in = this case we cant rely on the trans to be always there in the transMap, eve= n after a put, so in that case change above probably makes sense > 2.) Is it better idea to use WeakHashMap with WeakReference value instead= ? (was looking at org.apache.thrift.transport.TSaslServerTransport, especi= ally change made by THRIFT-1468 as an example) > e.g. > {code} > private static Map> tr= ansMap =3D Collections.synchronizedMap(new WeakHashMap>()); > {code} > getTransport() would be something like > {code} > public TUGIContainingTransport getTransport(TTransport trans) { > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82WeakReference ret =3D transMap.get(trans); > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82if (ret =3D=3D null= || ret.get() =3D=3D null) { > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82r= et =3D new WeakReference(new TUGIContainingTranspo= rt(trans));=E2=80=82=E2=80=82 > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82t= ransMap.put(trans, ret); // No need for putIfAbsent(). > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82 // Concu= rrent calls to getTransport() will pass in different TTransports. > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82} > =E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82return ret.get(); > } > {code} > 3.) Is it absolutely essential that successive calls to get getTransport(= ) need to retrieve the same TUGIContainingTransport object or its ok to cre= ate a new one, if its not available in the map ? -- This message was sent by Atlassian JIRA (v6.1#6144)