hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: OutOfOrderScannerNextException consistently killing mapreduce jobs
Date Thu, 25 Jun 2015 17:19:27 GMT
Have you read this thread http://search-hadoop.com/m/YGbb1sOLh2W9Z9z ?

Cheers

On Thu, Jun 25, 2015 at 10:10 AM, Mateusz Kaczynski <mateusz@arachnys.com>
wrote:

> One of our clusters running HBase 0.98.6-cdh5.3.0 used to work (relatively)
> smoothly until a couple of days ago, when out of the sudden jobs stated
> grinding to a halt and getting killed upon reporting a massive amount of
> errors of form:
>
> org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of
> OutOfOrderScannerNextException: was there a rpc timeout?
> at
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:410)
> at
>
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:230)
> at
>
> org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138)
> at
>
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
> at
>
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
> at
>
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
> at org.apache.hadoop.mapred.Child.main(Child.java:262)
>
> HBase regionservers contain a bunch of:
> WARN  [B.defaultRpcServer.handler=16,queue=1,port=60020] ipc.RpcServer:
> B.defaultRpcServer.handler=16,queue=1,port=60020: caught a
> ClosedChannelException, this means that the server was processing a request
> but the client went away. The error message was: null
>
> and:
> INFO  [regionserver60020.leaseChecker] regionserver.HRegionServer: Scanner
> 1086 lease expired on region
>
> table,8bf8fc3cd0e842c00fb4e556bbbdcd0f,1420155383100.19f5ed7c735d33b2cf8997e0b373a1a7
>
> in addition there are reports of compactions (not sure if relevant at all):
> regionserver.HStore: Completed major compaction of 3 file(s) in cf of
>
> table,fc0caf49fa871a61702fa3781e160101,1420728621152.9ccc317ca180cabde13864d4600c8693.
> into efd8bec4dbf54ccca5f1351bfe9890c3(size=5.9 G), total size for store is
> 5.9 G. This selection was in queue for 0sec, and took 1mins, 57sec to
> execute.
>
> I've adjusted the following, thinking it might be scanner cache size issue
> (we're dealing with docs of circa 100kb):
> hbase.rpc.timeout - 900000
> hbase.regionserver.lease.period - 450000
> hbase.client.scanner.timeout.period - 450000
> hbase.client.scanner.caching - (down to) 50
>
> To no avail. So I stripped the hbase config from hbase-site.xml to bare
> minumum but I can reproduce it with a striking accuracy. The minimalistic
> job reads from a table(c 3500 regions, 17 nodes), uses NullOutputFormat but
> doesn't write to it, mappers's map function doesn't do anything.
>
> It starts pretty fast getting through 1.75% of the specified scan in ~1
> minute. Then hits 2.5% in ~2m, 3% in ~3m. Then around 4m20s, a massive wave
> of aforementioned OutOfOrderScannerNextException starts pouring in, slowing
> the job down until it fails ~1h later.
>
> I checked the nodes memory and disk usage on the individual nodes - all
> good, open file permissions are set relatively high, we're clearly not
> hitting the limit.
>
> I'm running out of sanity and was wondering if anyone might have any ideas?
>
>
> --
> *Mateusz Kaczynski*
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message