hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashutosh Chauhan <hashut...@apache.org>
Subject Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
Date Thu, 29 Aug 2013 18:53:27 GMT
Thanks Agatea for digging in. Seems like you have hit a bug. Would you mind
opening a jira and adding your findings to it.

Thanks,
Ashutosh


On Thu, Aug 29, 2013 at 11:22 AM, agateaaa <agateaaa@gmail.com> wrote:

> Sorry hit send too soon ...
>
> Hi All:
>
> Put some debugging code in TUGIContainingTransport.getTransport() and I
> tracked it down to
>
> @Override
> public TUGIContainingTransport getTransport(TTransport trans) {
>
> // UGI information is not available at connection setup time, it will be
> set later
> // via set_ugi() rpc.
> transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));
>
> //return transMap.get(trans); //<-change
>           TUGIContainingTransport retTrans = transMap.get(trans);
>
>           if ( retTrans == null ) {
>              LOGGER.error (" cannot find transport that was in map !!")
>            }  else {
>              LOGGER.debug (" cannot find transport that was in map !!")
>              return retTrans;
>        }
> }
>
> When we run this in our test environment, see that we run into the problem
> just after GC runs,
> and "cannot find transport that was in the map!!" message gets logged.
>
> Could the GC be collecting entries from transMap, just before the we get it
>
> Tried a minor change which seems to work
>
> public TUGIContainingTransport getTransport(TTransport trans) {
>
>    TUGIContainingTransport retTrans = transMap.get(trans);
>
>     if ( retTrans == null ) {
> // UGI information is not available at connection setup time, it will be
> set later
> // via set_ugi() rpc.
> transMap.putIfAbsent(trans, retTrans);
>     }
>    return retTrans;
> }
>
>
> My questions for hive and  thrift experts
>
> 1.) Do we need to use a ConcurrentMap
> ConcurrentMap<TTransport, TUGIContainingTransport> transMap = new
> MapMaker().weakKeys().weakValues().makeMap();
> It does use == to compare keys (which might be the problem), also in this
> case we cant rely on the trans to be always there in the transMap, even
> after a put, so in that case change above
> probably makes sense
>
>
> 2.) Is it better idea to use WeakHashMap with WeakReference instead ? (was
> looking at org.apache.thrift.transport.TSaslServerTransport, esp change
> made by THRIFT-1468)
>
> e.g.
> private static Map<TTransport, WeakReference<TUGIContainingTransport>>
> transMap3 = Collections.synchronizedMap(new WeakHashMap<TTransport,
> WeakReference<TUGIContainingTransport>>());
>
> getTransport() would be something like
>
> public TUGIContainingTransport getTransport(TTransport trans) {
> WeakReference<TUGIContainingTransport> ret = transMap.get(trans);
> if (ret == null || ret.get() == null) {
> ret = new WeakReference<TUGIContainingTransport>(new
> TUGIContainingTransport(trans));
> transMap3.put(trans, ret); // No need for putIfAbsent().
> // Concurrent calls to getTransport() will pass in different TTransports.
> }
> return ret.get();
> }
>
>
> I did try 1.) above in our test environment and it does seem to resolve the
> problem, though i am not sure if I am introducing any other problem
>
>
> Can someone help ?
>
>
> Thanks
> Agatea
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Aug 29, 2013 at 10:57 AM, agateaaa <agateaaa@gmail.com> wrote:
>
> > Hi All:
> >
> > Put some debugging code in TUGIContainingTransport.getTransport() and I
> > tracked it down to
> >
> > @Override
> > public TUGIContainingTransport getTransport(TTransport trans) {
> >
> > // UGI information is not available at connection setup time, it will be
> > set later
> > // via set_ugi() rpc.
> > transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));
> >
> > //return transMap.get(trans); <-change
> >           TUGIContainingTransport retTrans = transMap.get(trans);
> >
> >           if ( retTrans == null ) {
> >
> >
> >
> > }
> >
> >
> >
> >
> >
> > On Wed, Jul 31, 2013 at 9:48 AM, agateaaa <agateaaa@gmail.com> wrote:
> >
> >> Thanks Nitin
> >>
> >> There arent too many connections in close_wait state only 1 or two when
> >> we run into this. Most likely its because of dropped connection.
> >>
> >> I could not find any read or write timeouts we can set for the thrift
> >> server which will tell thrift to hold on to the client connection.
> >>  See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt
> >> seem to have been implemented yet. We do have set a client connection
> >> timeout but cannot find
> >> an equivalent setting for the server.
> >>
> >> We have  a suspicion that this happens when we run two client processes
> >> which modify two distinct partitions of the same hive table. We put in a
> >> workaround so that the two hive client processes never run together and
> so
> >> far things look ok but we will keep monitoring.
> >>
> >> Could it be because hive metastore server is not thread safe, would
> >> running two alter table statements on two distinct partitions of the
> same
> >> table using two client connections cause problems like these, where hive
> >> metastore server closes or drops a wrong client connection and leaves
> the
> >> other hanging?
> >>
> >> Agateaaa
> >>
> >>
> >>
> >>
> >> On Tue, Jul 30, 2013 at 12:49 AM, Nitin Pawar <nitinpawar432@gmail.com
> >wrote:
> >>
> >>> The mentioned flow is called when you have unsecure mode of thrift
> >>> metastore client-server connection. So one way to avoid this is have a
> >>> secure way.
> >>>
> >>> <code>
> >>> public boolean process(final TProtocol in, final TProtocol out)
> >>> throwsTException {
> >>> setIpAddress(in);
> >>> ...
> >>> ...
> >>> ...
> >>> @Override
> >>>      protected void setIpAddress(final TProtocol in) {
> >>>     TUGIContainingTransport ugiTrans =
> >>> (TUGIContainingTransport)in.getTransport();
> >>>                     Socket socket = ugiTrans.getSocket();
> >>>     if (socket != null) {
> >>>       setIpAddress(socket);
> >>>
> >>> </code>
> >>>
> >>>
> >>> From the above code snippet, it looks like the null pointer exception
> is
> >>> not handled if the getSocket returns null.
> >>>
> >>> can you check whats the ulimit setting on the server? If its set to
> >>> default
> >>> can you set it to unlimited and restart hcat server. (This is just a
> wild
> >>> guess).
> >>>
> >>> also the getSocket method suggests "If the underlying TTransport is an
> >>> instance of TSocket, it returns the Socket object which it contains.
> >>> Otherwise it returns null."
> >>>
> >>> so someone from thirft gurus need to tell us whats happening. I have no
> >>> knowledge of this depth
> >>>
> >>> may be Ashutosh or Thejas will be able to help on this.
> >>>
> >>>
> >>>
> >>>
> >>> From the netstat close_wait, it looks like the hive metastore server
> has
> >>> not closed the connection (do not know why yet), may be the hive dev
> guys
> >>> can help.Are there too many connections in close_wait state?
> >>>
> >>>
> >>>
> >>> On Tue, Jul 30, 2013 at 5:52 AM, agateaaa <agateaaa@gmail.com> wrote:
> >>>
> >>> > Looking at the hive metastore server logs see errors like these:
> >>> >
> >>> > 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer
> >>> > (TThreadPoolServer.java:run(182)) - Error occurred during processing
> of
> >>> > message.
> >>> > java.lang.NullPointerException
> >>> >         at
> >>> >
> >>> >
> >>>
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183)
> >>> >         at
> >>> >
> >>> >
> >>>
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79)
> >>> >         at
> >>> >
> >>> >
> >>>
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
> >>> > at
> >>> >
> >>> >
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>> >  at
> >>> >
> >>> >
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>> > at java.lang.Thread.run(Thread.java:662)
> >>> >
> >>> > approx same time as we see timeout or connection reset errors.
> >>> >
> >>> > Dont know if this is the cause or the side affect of he connection
> >>> > timeout/connection reset errors. Does anybody have any pointers or
> >>> > suggestions ?
> >>> >
> >>> > Thanks
> >>> >
> >>> >
> >>> > On Mon, Jul 29, 2013 at 11:29 AM, agateaaa <agateaaa@gmail.com>
> wrote:
> >>> >
> >>> > > Thanks Nitin!
> >>> > >
> >>> > > We have simiar setup (identical hcatalog and hive server versions)
> >>> on a
> >>> > > another production environment and dont see any errors (its been
> >>> running
> >>> > ok
> >>> > > for a few months)
> >>> > >
> >>> > > Unfortunately we wont be able to move to hcat 0.5 and hive 0.11
or
> >>> hive
> >>> > > 0.10 soon.
> >>> > >
> >>> > > I did see that the last time we ran into this problem doing a
> >>> netstat-ntp
> >>> > > | grep ":10000" see that server was holding on to one socket
> >>> connection
> >>> > in
> >>> > > CLOSE_WAIT state for a long time
> >>> > >  (hive metastore server is running on port 10000). Dont know if
> thats
> >>> > > relevant here or not
> >>> > >
> >>> > > Can you suggest any hive configuration settings we can tweak or
> >>> > networking
> >>> > > tools/tips, we can use to narrow this down ?
> >>> > >
> >>> > > Thanks
> >>> > > Agateaaa
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Mon, Jul 29, 2013 at 11:02 AM, Nitin Pawar <
> >>> nitinpawar432@gmail.com
> >>> > >wrote:
> >>> > >
> >>> > >> Is there any chance you can do a update on test environment
with
> >>> > hcat-0.5
> >>> > >> and hive-0(11 or 10) and see if you can reproduce the issue?
> >>> > >>
> >>> > >> We used to see this error when there was load on hcat server
or
> some
> >>> > >> network issue connecting to the server(second one was rare
> >>> occurrence)
> >>> > >>
> >>> > >>
> >>> > >> On Mon, Jul 29, 2013 at 11:13 PM, agateaaa <agateaaa@gmail.com>
> >>> wrote:
> >>> > >>
> >>> > >>> Hi All:
> >>> > >>>
> >>> > >>> We are running into frequent problem using HCatalog 0.4.1
(HIve
> >>> > Metastore
> >>> > >>> Server 0.9) where we get connection reset or connection
timeout
> >>> errors.
> >>> > >>>
> >>> > >>> The hive metastore server has been allocated enough (12G)
memory.
> >>> > >>>
> >>> > >>> This is a critical problem for us and would appreciate
if anyone
> >>> has
> >>> > any
> >>> > >>> pointers.
> >>> > >>>
> >>> > >>> We did add a retry logic in our client, which seems to
help, but
> I
> >>> am
> >>> > >>> just
> >>> > >>> wondering how can we narrow down to the root cause
> >>> > >>> of this problem. Could this be a hiccup in networking
which
> causes
> >>> the
> >>> > >>> hive
> >>> > >>> server to get into a unresponsive state  ?
> >>> > >>>
> >>> > >>> Thanks
> >>> > >>>
> >>> > >>> Agateaaa
> >>> > >>>
> >>> > >>>
> >>> > >>> Example Connection reset error:
> >>> > >>> =======================
> >>> > >>>
> >>> > >>> org.apache.thrift.transport.TTransportException:
> >>> > >>> java.net.SocketException:
> >>> > >>> Connection reset
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
> >>> > >>>  at
> >>> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
> >>> > >>>  at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
> >>> > >>>  at
> >>> > org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
> >>> > >>>  at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
> >>> > >>>  at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157)
> >>> > >>>  at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
> >>> > >>> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
> >>> > >>>  at
> org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableAddParts(DDLSemanticAnalyzer.java:1817)
> >>> > >>>  at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:297)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
> >>> > >>>  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
> >>> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
> >>> > >>>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
> >>> > >>> at
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
> >>> > >>>  at
> >>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
> >>> > >>> at
> >>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
> >>> > >>>  at
> >>> > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
> >>> > >>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)
> >>> > >>>  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
> >>> > >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
> >>> > >>>  at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>> > >>>  at java.lang.reflect.Method.invoke(Method.java:597)
> >>> > >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >>> > >>> Caused by: java.net.SocketException: Connection reset
> >>> > >>> at java.net.SocketInputStream.read(SocketInputStream.java:168)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
> >>> > >>>  ... 30 more
> >>> > >>>
> >>> > >>>
> >>> > >>>
> >>> > >>>
> >>> > >>> Example Connection timeout error:
> >>> > >>> ==========================
> >>> > >>>
> >>> > >>> org.apache.thrift.transport.TTransportException:
> >>> > >>> java.net.SocketTimeoutException: Read timed out
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
> >>> > >>>  at
> >>> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
> >>> > >>>  at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
> >>> > >>>  at
> >>> > org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
> >>> > >>>  at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
> >>> > >>>  at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157)
> >>> > >>>  at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
> >>> > >>> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
> >>> > >>>  at
> org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
> >>> > >>> at
> org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:830)
> >>> > >>>  at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:954)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7524)
> >>> > >>>  at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
> >>> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
> >>> > >>>  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
> >>> > >>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
> >>> > >>>  at
> >>> > >>>
> >>> >
> >>>
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
> >>> > >>> at
> >>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
> >>> > >>>  at
> >>> > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
> >>> > >>> at
> >>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
> >>> > >>>  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)
> >>> > >>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
> >>> > >>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>> > >>>  at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>> > >>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>> > >>>  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >>> > >>> Caused by: java.net.SocketTimeoutException: Read timed
out
> >>> > >>> at java.net.SocketInputStream.socketRead0(Native Method)
> >>> > >>>  at java.net.SocketInputStream.read(SocketInputStream.java:129)
> >>> > >>> at
> >>> > >>>
> >>> > >>>
> >>> >
> >>>
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
> >>> > >>>  ... 31 more
> >>> > >>>
> >>> > >>
> >>> > >>
> >>> > >>
> >>> > >> --
> >>> > >> Nitin Pawar
> >>> > >>
> >>> > >
> >>> > >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Nitin Pawar
> >>>
> >>
> >>
> >
>

Mime
View raw message