hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <vladrodio...@gmail.com>
Subject Re: OutOfOrderScannerNextException consistently killing mapreduce jobs
Date Thu, 25 Jun 2015 17:25:21 GMT
Mateusz,

- How many regions do you have in your table?
- What is the cluster size?
- What is the scan spec in your M/R job (time range, filters)
- RS node spec (CPUs, RAM, disks)

Ted's link is a good start point.

-Vlad

On Thu, Jun 25, 2015 at 10:19 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> Have you read this thread http://search-hadoop.com/m/YGbb1sOLh2W9Z9z ?
>
> Cheers
>
> On Thu, Jun 25, 2015 at 10:10 AM, Mateusz Kaczynski <mateusz@arachnys.com>
> wrote:
>
> > One of our clusters running HBase 0.98.6-cdh5.3.0 used to work
> (relatively)
> > smoothly until a couple of days ago, when out of the sudden jobs stated
> > grinding to a halt and getting killed upon reporting a massive amount of
> > errors of form:
> >
> > org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of
> > OutOfOrderScannerNextException: was there a rpc timeout?
> > at
> > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:410)
> > at
> >
> >
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:230)
> > at
> >
> >
> org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138)
> > at
> >
> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
> > at
> >
> >
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
> > at
> >
> >
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)
> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
> > at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:415)
> > at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
> > at org.apache.hadoop.mapred.Child.main(Child.java:262)
> >
> > HBase regionservers contain a bunch of:
> > WARN  [B.defaultRpcServer.handler=16,queue=1,port=60020] ipc.RpcServer:
> > B.defaultRpcServer.handler=16,queue=1,port=60020: caught a
> > ClosedChannelException, this means that the server was processing a
> request
> > but the client went away. The error message was: null
> >
> > and:
> > INFO  [regionserver60020.leaseChecker] regionserver.HRegionServer:
> Scanner
> > 1086 lease expired on region
> >
> >
> table,8bf8fc3cd0e842c00fb4e556bbbdcd0f,1420155383100.19f5ed7c735d33b2cf8997e0b373a1a7
> >
> > in addition there are reports of compactions (not sure if relevant at
> all):
> > regionserver.HStore: Completed major compaction of 3 file(s) in cf of
> >
> >
> table,fc0caf49fa871a61702fa3781e160101,1420728621152.9ccc317ca180cabde13864d4600c8693.
> > into efd8bec4dbf54ccca5f1351bfe9890c3(size=5.9 G), total size for store
> is
> > 5.9 G. This selection was in queue for 0sec, and took 1mins, 57sec to
> > execute.
> >
> > I've adjusted the following, thinking it might be scanner cache size
> issue
> > (we're dealing with docs of circa 100kb):
> > hbase.rpc.timeout - 900000
> > hbase.regionserver.lease.period - 450000
> > hbase.client.scanner.timeout.period - 450000
> > hbase.client.scanner.caching - (down to) 50
> >
> > To no avail. So I stripped the hbase config from hbase-site.xml to bare
> > minumum but I can reproduce it with a striking accuracy. The minimalistic
> > job reads from a table(c 3500 regions, 17 nodes), uses NullOutputFormat
> but
> > doesn't write to it, mappers's map function doesn't do anything.
> >
> > It starts pretty fast getting through 1.75% of the specified scan in ~1
> > minute. Then hits 2.5% in ~2m, 3% in ~3m. Then around 4m20s, a massive
> wave
> > of aforementioned OutOfOrderScannerNextException starts pouring in,
> slowing
> > the job down until it fails ~1h later.
> >
> > I checked the nodes memory and disk usage on the individual nodes - all
> > good, open file permissions are set relatively high, we're clearly not
> > hitting the limit.
> >
> > I'm running out of sanity and was wondering if anyone might have any
> ideas?
> >
> >
> > --
> > *Mateusz Kaczynski*
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message