hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eran Kutner <e...@gigya.com>
Subject Re: Lease does not exist exceptions
Date Tue, 18 Oct 2011 13:39:24 GMT
Hi Stack,
Yep, reducing the number of map tasks did resolve the problem, however the
only way I found for doing it is by changing the setting in the
mapred-site.xml file, which means it will affect all my jobs. Do you know if
there is a way to limit the number of concurrent map tasks a specific job
may run? I know it was possible with the old JobConf class from the mapred
namespace but the new Job class doesn't have the setNumMapTasks() method.
Is it possible to extend the lease timeout? I'm not even sure lease on what,
HDFS blocks? What is it by default?

As for setBatch, what would be a good value? I didn't set it before and
setting it didn't seem to change anything.

Finally to answer your question regarding the intensity of the job - yes, it
is pretty intense, getting cpu and disk IO utilization to ~90%

Thanks a million!

-eran



On Tue, Oct 18, 2011 at 13:06, Stack <stack@duboce.net> wrote:

> Look back in the mailing list Eran for more detailed answers but in
> essence, the below usually means that the client has been away from
> the server too long.  This can happen for a few reasons.  If you fetch
> lots of rows per next on a scanner, processing the batch client side
> may be taking you longer than the lease timeout.  Set down the
> prefetch size and see if that helps (I'm talking about this:
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)
> ).
>  Throw in a GC on client-side or over on the server-side and it might
> put you over your lease timeout.  Are your mapreduce jobs heavy-duty
> robbing resources from the running regionservers or datanodes?  Try
> having them run half the mappers and see if that makes it more likely
> your job will complete.
>
> St.Ack
> P.S IIRC, J-D tripped over a cause recently but I can't find it at the mo.
>
> On Tue, Oct 18, 2011 at 10:28 AM, Eran Kutner <eran@gigya.com> wrote:
> > Hi,
> > I'm having a problem when running map/reduce on a table with about 500
> > regions.
> > The MR job shows this kind of excpetions:
> > 11/10/18 06:03:39 INFO mapred.JobClient: Task Id :
> > attempt_201110030100_0086_m_000062_0, Status : FAILED
> > org.apache.hadoop.hbase.regionserver.LeaseException:
> > org.apache.hadoop.hbase.regionserver.LeaseException: lease
> > '-334679770697295011' does not exist
> >        at
> > org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1845)
> >        at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> >        at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
> >
> >        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> >        at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> >        at
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> >        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >        at
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:96)
> >        at
> >
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:83)
> >        at
> >
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:1)
> >        at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1019)
> >        at
> >
> org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1151)
> >        at
> >
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:149)
> >        at
> >
> org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:142)
> >        at
> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
> >        at
> > org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> >        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> >        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
> >        at java.security.AccessController.doPrivileged(Native Method)
> >        at javax.security.auth.Subject.doAs(Subject.java:396)
> >        at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
> >        at org.apache.hadoop.mapred.Child.main(Child.java:264)
> >
> > the hbase logs are full of these:
> > 2011-10-18 06:07:01,425 ERROR
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > org.apache.hadoop.hbase.regionserver.LeaseException: lease
> > '3475143032285946374' does not exist
> >        at
> > org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1845)
> >        at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
> >        at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
> >
> >
> > and the datanodes logs have a few (seem to be a lot less than the hbase
> > errors) of these:
> > 2011-10-18 06:16:42,550 ERROR
> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > 10.1.104.4:50010, storageID=DS-15546166-10.1.104.4-50010-1298985607414,
> > infoPort=50075, ipcPort=50020):DataXceiver
> > java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> > channel to be ready for write. ch :
> > java.nio.channels.SocketChannel[connected local=/10.1.104.4:50010remote=/
> > 10.1.104.1:57232]
> >        at
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:214)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:114)
> >
> > I've increased all the relevant limits I know of (which were high to
> begin
> > with), so now I have 64K file descriptors and dfs.datanode.max.xcievers
> is
> > 8192 .
> > I've restarted everything in the cluster, to make sure all the processed
> > picked the new configurations, but I still get those errors. They always
> > begin when the map phase is around 12-14% and eventually the job fails at
> > ~50%
> > Running random scans against the same  hbase table while the job is
> running
> > seems to work fine.
> >
> > I'm using hadoop 0.20.2+923.97-1 from CDH3 and hbase 0.90.4 compiled from
> > the branch code a while ago.
> >
> > Any other setting I'm missing or other ideas of what can be causing it?
> >
> > Thanks.
> >
> > -eran
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message