hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "agate (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-5172) TUGIContainingTransport returning null transport, causing intermittent SocketTimeoutException on hive client and NullPointerException in TUGIBasedProcessor on the server
Date Thu, 29 Aug 2013 21:39:51 GMT

     [ https://issues.apache.org/jira/browse/HIVE-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

agate updated HIVE-5172:
------------------------

    Description: 
We are running into frequent problem using HCatalog 0.4.1 (Hive Metastore Server 0.9) where
we get connection reset or connection timeout errors on the client and NullPointerException
in TUGITransport on the server. 



hive client logs:
=================

org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed
out
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:830)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:954)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7524)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
... 31 more



hive metastore server logs:
===============================

2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer (TThreadPoolServer.java:run(182)) -
Error occurred during processing of message.
java.lang.NullPointerException
        at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183)
        at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


Adding some extra debug log messages in TUGIBasedProcessor, noticed that the TUGIContainingTransport
is null which results in NullPointerException on the server.

Further drilling into TUGIContainingTransport noticed that getTransport() returns a null which
causes the above 
error. 

Further corelating with GC logs observed that that error always hits when the CMS GC has just
kicked in 
(but does not happen after every GC)



Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to


@Override
public TUGIContainingTransport getTransport(TTransport trans) {

      // UGI information is not available at connection setup time, it will be
set later
      // via set_ugi() rpc.
      transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));

      //return transMap.get(trans); //<-change

      TUGIContainingTransport retTrans = transMap.get(trans);

      if ( retTrans == null ) {
             LOGGER.error ("cannot find transport that was in map !!")
	     return retTrans;
       }  else {
             LOGGER.debug ("found transport in map")
             return retTrans;    
       }
}

When we run this in our test environment, see that we run into the problem just after GC runs,
and "cannot find transport that was in the map!!" message gets logged.

Could the GC be collecting entries from transMap, just before the we get it 

Tried a minor change which seems to work

public TUGIContainingTransport getTransport(TTransport trans) {

   TUGIContainingTransport retTrans = transMap.get(trans);

    if ( retTrans == null ) {
      // UGI information is not available at connection setup time, it will be
set later
      // via set_ugi() rpc.
      transMap.putIfAbsent(trans, retTrans);
    }
   return retTrans;
}


My questions for hive and  thrift experts

1.) Do we need to use a ConcurrentMap
ConcurrentMap<TTransport, TUGIContainingTransport> transMap = new MapMaker().weakKeys().weakValues().makeMap();
It does use == to compare keys (which might be the problem), also in this case we cant rely
on the trans to be always there in the transMap, even after a put, so in that case change
above probably makes sense

2.) Is it better idea to use WeakHashMap with WeakReference value instead ? (was looking at
org.apache.thrift.transport.TSaslServerTransport, especially change made by THRIFT-1468 as
an example)

e.g.

private static Map<TTransport, WeakReference<TUGIContainingTransport>> transMap
= Collections.synchronizedMap(new WeakHashMap<TTransport, WeakReference<TUGIContainingTransport>>());

getTransport() would be something like

public TUGIContainingTransport getTransport(TTransport trans) {
      WeakReference<TUGIContainingTransport> ret = transMap.get(trans);
      if (ret == null || ret.get() == null) {
        ret = new WeakReference<TUGIContainingTransport>(new TUGIContainingTransport(trans));  
        transMap.put(trans, ret); // No need for putIfAbsent().
         // Concurrent calls to getTransport() will pass in different TTransports.
      }
      return ret.get();
}

3.) Is it absolutely essential that successive calls to get getTransport() need to retrieve
the same TUGIContainingTransport object or its ok to create a new one, if its not available
in the map ?

  was:
We are running into frequent problem using HCatalog 0.4.1 (Hive Metastore Server 0.9) where
we get connection reset or connection timeout errors on the client and NullPointerException
in TUGITransport on the server. 



hive client logs:
=================

org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed
out
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:830)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:954)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7524)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
... 31 more



hive metastore server logs:
===============================

2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer (TThreadPoolServer.java:run(182)) -
Error occurred during processing of message.
java.lang.NullPointerException
        at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183)
        at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


Adding some extra debug log messages in TUGIBasedProcessor, noticed that the TUGIContainingTransport
is null which results in NullPointerException on the server.

Further drilling into TUGIContainingTransport noticed that getTransport() returns a null which
causes the above 
error. 

Further corelating with GC logs observed that that error always hits when the CMS GC has just
kicked in 
(but does not happen after every GC)



Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to


@Override
public TUGIContainingTransport getTransport(TTransport trans) {

      // UGI information is not available at connection setup time, it will be
set later
      // via set_ugi() rpc.
      transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));

      //return transMap.get(trans); //<-change

      TUGIContainingTransport retTrans = transMap.get(trans);

      if ( retTrans == null ) {
             LOGGER.error ("cannot find transport that was in map !!")
	     return retTrans;
       }  else {
             LOGGER.debug ("found transport in map")
             return retTrans;    
       }
}

When we run this in our test environment, see that we run into the problem just after GC runs,
and "cannot find transport that was in the map!!" message gets logged.

Could the GC be collecting entries from transMap, just before the we get it 

Tried a minor change which seems to work

public TUGIContainingTransport getTransport(TTransport trans) {

   TUGIContainingTransport retTrans = transMap.get(trans);

    if ( retTrans == null ) {
      // UGI information is not available at connection setup time, it will be
set later
      // via set_ugi() rpc.
      transMap.putIfAbsent(trans, retTrans);
    }
   return retTrans;
}


My questions for hive and  thrift experts

1.) Do we need to use a ConcurrentMap
ConcurrentMap<TTransport, TUGIContainingTransport> transMap = new MapMaker().weakKeys().weakValues().makeMap();
It does use == to compare keys (which might be the problem), also in this case we cant rely
on the trans to be always there in the transMap, even after a put, so in that case change
above probably makes sense

2.) Is it better idea to use WeakHashMap with WeakReference value instead ? (was looking at
org.apache.thrift.transport.TSaslServerTransport, especially change made by THRIFT-1468 as
an example)

e.g.

private static Map<TTransport, WeakReference<TUGIContainingTransport>> transMap
= Collections.synchronizedMap(new WeakHashMap<TTransport, WeakReference<TUGIContainingTransport>>());

getTransport() would be something like

public TUGIContainingTransport getTransport(TTransport trans) {
      WeakReference<TUGIContainingTransport> ret = transMap.get(trans);
      if (ret == null || ret.get() == null) {
        ret = new WeakReference<TUGIContainingTransport>(new TUGIContainingTransport(trans));  
        transMap.put(trans, ret); // No need for putIfAbsent().
                                   //
Concurrent calls to getTransport() will pass in different TTransports.
      }
      return ret.get();
}

3.) Is it absolutely essential that successive calls to get getTransport() need to retrieve
the same TUGIContainingTransport object or its ok to create a new one, if its not available
in the map ?

    
> TUGIContainingTransport returning null transport, causing intermittent SocketTimeoutException
on hive client and NullPointerException in TUGIBasedProcessor on the server
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-5172
>                 URL: https://issues.apache.org/jira/browse/HIVE-5172
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>    Affects Versions: 0.9.0
>            Reporter: agate
>
> We are running into frequent problem using HCatalog 0.4.1 (Hive Metastore Server 0.9)
where we get connection reset or connection timeout errors on the client and NullPointerException
in TUGITransport on the server. 
> hive client logs:
> =================
> org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read
timed out
> at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
> at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
> at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
> at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
> at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
> at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
> at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157)
> at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:830)
> at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:954)
> at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7524)
> at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> Caused by: java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
> ... 31 more
> hive metastore server logs:
> ===============================
> 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer (TThreadPoolServer.java:run(182))
- Error occurred during processing of message.
> java.lang.NullPointerException
>         at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183)
>         at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79)
>         at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Adding some extra debug log messages in TUGIBasedProcessor, noticed that the TUGIContainingTransport
is null which results in NullPointerException on the server.
> Further drilling into TUGIContainingTransport noticed that getTransport() returns a null
which causes the above 
> error. 
> Further corelating with GC logs observed that that error always hits when the CMS GC
has just kicked in 
> (but does not happen after every GC)
> Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down
to 
> @Override
> public TUGIContainingTransport getTransport(TTransport trans) {
>       // UGI information is not available at connection setup time, it will
be set later
>       // via set_ugi() rpc.
>       transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));
>       //return transMap.get(trans); //<-change
>       TUGIContainingTransport retTrans = transMap.get(trans);
>       if ( retTrans == null ) {
>              LOGGER.error ("cannot find transport that was in map !!")
> 	     return retTrans;
>        }  else {
>              LOGGER.debug ("found transport in map")
>              return retTrans;    
>        }
> }
> When we run this in our test environment, see that we run into the problem just after
GC runs,
> and "cannot find transport that was in the map!!" message gets logged.
> Could the GC be collecting entries from transMap, just before the we get it 
> Tried a minor change which seems to work
> public TUGIContainingTransport getTransport(TTransport trans) {
>    TUGIContainingTransport retTrans = transMap.get(trans);
>     if ( retTrans == null ) {
>       // UGI information is not available at connection setup time, it will
be set later
>       // via set_ugi() rpc.
>       transMap.putIfAbsent(trans, retTrans);
>     }
>    return retTrans;
> }
> My questions for hive and  thrift experts
> 1.) Do we need to use a ConcurrentMap
> ConcurrentMap<TTransport, TUGIContainingTransport> transMap = new MapMaker().weakKeys().weakValues().makeMap();
> It does use == to compare keys (which might be the problem), also in this case we cant
rely on the trans to be always there in the transMap, even after a put, so in that case change
above probably makes sense
> 2.) Is it better idea to use WeakHashMap with WeakReference value instead ? (was looking
at org.apache.thrift.transport.TSaslServerTransport, especially change made by THRIFT-1468
as an example)
> e.g.
> private static Map<TTransport, WeakReference<TUGIContainingTransport>> transMap
= Collections.synchronizedMap(new WeakHashMap<TTransport, WeakReference<TUGIContainingTransport>>());
> getTransport() would be something like
> public TUGIContainingTransport getTransport(TTransport trans) {
>       WeakReference<TUGIContainingTransport> ret = transMap.get(trans);
>       if (ret == null || ret.get() == null) {
>         ret = new WeakReference<TUGIContainingTransport>(new TUGIContainingTransport(trans));  
>         transMap.put(trans, ret); // No need for putIfAbsent().
>          // Concurrent calls to getTransport() will pass in different TTransports.
>       }
>       return ret.get();
> }
> 3.) Is it absolutely essential that successive calls to get getTransport() need to retrieve
the same TUGIContainingTransport object or its ok to create a new one, if its not available
in the map ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message