Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of dejan.menges@gmail.com
 designates 209.85.212.180 as permitted sender)
MIME-Version: 1.0
References: 
 <CAB6wDWbx=PRRRVH_VuPw+E6RbtKsNCutsRftYTdMA03hUrTJhA@mail.gmail.com>
 <CALte62z84SqRx8V5RkS_7zQAxFWG4MnEOeZK6xgxVBE9xEqWXw@mail.gmail.com>
 <CAAg3a2ppp_XvXKPmjOknArbCmsvkHrKvVA80wJfRQyb2zJ867w@mail.gmail.com>
In-Reply-To: 
 <CAAg3a2ppp_XvXKPmjOknArbCmsvkHrKvVA80wJfRQyb2zJ867w@mail.gmail.com>
From: Dejan Menges <dejan.menges@gmail.com>
Date: Thu, 25 Jun 2015 18:13:44 +0000
Message-ID: 
 <CAEf6Z5Kx9LpqPRrzjpi6=ccJ+-t37bCw2BPRE1WAwNcc3EDH3w@mail.gmail.com>
Subject: Re: OutOfOrderScannerNextException consistently killing mapreduce
 jobs
To: "user@hbase.apache.org" <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=047d7b5d9acd10891605195b96a9

--047d7b5d9acd10891605195b96a9
Content-Type: text/plain; charset=UTF-8

We had recently the same issue, but in our case was hotspotting with
regions which were outstanding big. After splitting some regions and fixing
hotspotting issues this disappeared.

On Thu, Jun 25, 2015 at 7:25 PM Vladimir Rodionov <vladrodionov@gmail.com>
wrote:

> Mateusz,
>
> - How many regions do you have in your table?
> - What is the cluster size?
> - What is the scan spec in your M/R job (time range, filters)
> - RS node spec (CPUs, RAM, disks)
>
> Ted's link is a good start point.
>
> -Vlad
>
> On Thu, Jun 25, 2015 at 10:19 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > Have you read this thread http://search-hadoop.com/m/YGbb1sOLh2W9Z9z ?
> >
> > Cheers
> >
> > On Thu, Jun 25, 2015 at 10:10 AM, Mateusz Kaczynski <
> mateusz@arachnys.com>
> > wrote:
> >
> > > One of our clusters running HBase 0.98.6-cdh5.3.0 used to work
> > (relatively)
> > > smoothly until a couple of days ago, when out of the sudden jobs stated
> > > grinding to a halt and getting killed upon reporting a massive amount
> of
> > > errors of form:
> > >
> > > org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of
> > > OutOfOrderScannerNextException: was there a rpc timeout?
> > > at
> > >
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:410)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:230)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138)
> > > at
> > >
> > >
> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
> > > at
> > >
> > >
> >
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
> > > at
> > >
> > >
> >
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
> > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)
> > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
> > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
> > > at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> > > at java.security.AccessController.doPrivileged(Native Method)
> > > at javax.security.auth.Subject.doAs(Subject.java:415)
> > > at
> > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
> > > at org.apache.hadoop.mapred.Child.main(Child.java:262)
> > >
> > > HBase regionservers contain a bunch of:
> > > WARN  [B.defaultRpcServer.handler=16,queue=1,port=60020] ipc.RpcServer:
> > > B.defaultRpcServer.handler=16,queue=1,port=60020: caught a
> > > ClosedChannelException, this means that the server was processing a
> > request
> > > but the client went away. The error message was: null
> > >
> > > and:
> > > INFO  [regionserver60020.leaseChecker] regionserver.HRegionServer:
> > Scanner
> > > 1086 lease expired on region
> > >
> > >
> >
> table,8bf8fc3cd0e842c00fb4e556bbbdcd0f,1420155383100.19f5ed7c735d33b2cf8997e0b373a1a7
> > >
> > > in addition there are reports of compactions (not sure if relevant at
> > all):
> > > regionserver.HStore: Completed major compaction of 3 file(s) in cf of
> > >
> > >
> >
> table,fc0caf49fa871a61702fa3781e160101,1420728621152.9ccc317ca180cabde13864d4600c8693.
> > > into efd8bec4dbf54ccca5f1351bfe9890c3(size=5.9 G), total size for store
> > is
> > > 5.9 G. This selection was in queue for 0sec, and took 1mins, 57sec to
> > > execute.
> > >
> > > I've adjusted the following, thinking it might be scanner cache size
> > issue
> > > (we're dealing with docs of circa 100kb):
> > > hbase.rpc.timeout - 900000
> > > hbase.regionserver.lease.period - 450000
> > > hbase.client.scanner.timeout.period - 450000
> > > hbase.client.scanner.caching - (down to) 50
> > >
> > > To no avail. So I stripped the hbase config from hbase-site.xml to bare
> > > minumum but I can reproduce it with a striking accuracy. The
> minimalistic
> > > job reads from a table(c 3500 regions, 17 nodes), uses NullOutputFormat
> > but
> > > doesn't write to it, mappers's map function doesn't do anything.
> > >
> > > It starts pretty fast getting through 1.75% of the specified scan in ~1
> > > minute. Then hits 2.5% in ~2m, 3% in ~3m. Then around 4m20s, a massive
> > wave
> > > of aforementioned OutOfOrderScannerNextException starts pouring in,
> > slowing
> > > the job down until it fails ~1h later.
> > >
> > > I checked the nodes memory and disk usage on the individual nodes - all
> > > good, open file permissions are set relatively high, we're clearly not
> > > hitting the limit.
> > >
> > > I'm running out of sanity and was wondering if anyone might have any
> > ideas?
> > >
> > >
> > > --
> > > *Mateusz Kaczynski*
> > >
> >
>

--047d7b5d9acd10891605195b96a9--