Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2999718033 for ; Thu, 25 Jun 2015 18:14:22 +0000 (UTC) Received: (qmail 33741 invoked by uid 500); 25 Jun 2015 18:14:20 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 33670 invoked by uid 500); 25 Jun 2015 18:14:20 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 33658 invoked by uid 99); 25 Jun 2015 18:14:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Jun 2015 18:14:20 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dejan.menges@gmail.com designates 209.85.212.180 as permitted sender) Received: from [209.85.212.180] (HELO mail-wi0-f180.google.com) (209.85.212.180) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Jun 2015 18:12:05 +0000 Received: by wicgi11 with SMTP id gi11so83520472wic.0 for ; Thu, 25 Jun 2015 11:13:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-type; bh=tC6v0Z0/iBedfmPs76T72On3NMNcy7tMC42C3jBW75M=; b=d5dBaDtHv1cLZP/lwzBHxlBRSgNFDli1ODgXo44sHBzoFeoxljaZj0TZmt3mi+zv0q HhArEPKSSUI+D4ynjsYgSMeGu/NJpestF5dzQ38C4ZL8q+u80pQfkFR0OThiElLZCVGc cmnSH51HCS6B1uvI806P+uMMS57Km8mduzHoikrc9Qy0xIOaPWr6hkyCZ/2nAaGq/VxI g5qlNTE/MUnNew9Oj7nqhzs7XQVZjCBEj835UsFVCCyOid3iBszckB96+2OHJtreuZ6p jc08ryal7f3oXorGuULG1mY5PsmJiAvnnAtCWzOwuzYj72OtgsEgGSqI7UIYDUHt04Kw ixkg== X-Received: by 10.194.6.37 with SMTP id x5mr84088362wjx.73.1435256033935; Thu, 25 Jun 2015 11:13:53 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Dejan Menges Date: Thu, 25 Jun 2015 18:13:44 +0000 Message-ID: Subject: Re: OutOfOrderScannerNextException consistently killing mapreduce jobs To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=047d7b5d9acd10891605195b96a9 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b5d9acd10891605195b96a9 Content-Type: text/plain; charset=UTF-8 We had recently the same issue, but in our case was hotspotting with regions which were outstanding big. After splitting some regions and fixing hotspotting issues this disappeared. On Thu, Jun 25, 2015 at 7:25 PM Vladimir Rodionov wrote: > Mateusz, > > - How many regions do you have in your table? > - What is the cluster size? > - What is the scan spec in your M/R job (time range, filters) > - RS node spec (CPUs, RAM, disks) > > Ted's link is a good start point. > > -Vlad > > On Thu, Jun 25, 2015 at 10:19 AM, Ted Yu wrote: > > > Have you read this thread http://search-hadoop.com/m/YGbb1sOLh2W9Z9z ? > > > > Cheers > > > > On Thu, Jun 25, 2015 at 10:10 AM, Mateusz Kaczynski < > mateusz@arachnys.com> > > wrote: > > > > > One of our clusters running HBase 0.98.6-cdh5.3.0 used to work > > (relatively) > > > smoothly until a couple of days ago, when out of the sudden jobs stated > > > grinding to a halt and getting killed upon reporting a massive amount > of > > > errors of form: > > > > > > org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of > > > OutOfOrderScannerNextException: was there a rpc timeout? > > > at > > > > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:410) > > > at > > > > > > > > > org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:230) > > > at > > > > > > > > > org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138) > > > at > > > > > > > > > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483) > > > at > > > > > > > > > org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76) > > > at > > > > > > > > > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85) > > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) > > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) > > > at org.apache.hadoop.mapred.Child$4.run(Child.java:268) > > > at java.security.AccessController.doPrivileged(Native Method) > > > at javax.security.auth.Subject.doAs(Subject.java:415) > > > at > > > > > > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) > > > at org.apache.hadoop.mapred.Child.main(Child.java:262) > > > > > > HBase regionservers contain a bunch of: > > > WARN [B.defaultRpcServer.handler=16,queue=1,port=60020] ipc.RpcServer: > > > B.defaultRpcServer.handler=16,queue=1,port=60020: caught a > > > ClosedChannelException, this means that the server was processing a > > request > > > but the client went away. The error message was: null > > > > > > and: > > > INFO [regionserver60020.leaseChecker] regionserver.HRegionServer: > > Scanner > > > 1086 lease expired on region > > > > > > > > > table,8bf8fc3cd0e842c00fb4e556bbbdcd0f,1420155383100.19f5ed7c735d33b2cf8997e0b373a1a7 > > > > > > in addition there are reports of compactions (not sure if relevant at > > all): > > > regionserver.HStore: Completed major compaction of 3 file(s) in cf of > > > > > > > > > table,fc0caf49fa871a61702fa3781e160101,1420728621152.9ccc317ca180cabde13864d4600c8693. > > > into efd8bec4dbf54ccca5f1351bfe9890c3(size=5.9 G), total size for store > > is > > > 5.9 G. This selection was in queue for 0sec, and took 1mins, 57sec to > > > execute. > > > > > > I've adjusted the following, thinking it might be scanner cache size > > issue > > > (we're dealing with docs of circa 100kb): > > > hbase.rpc.timeout - 900000 > > > hbase.regionserver.lease.period - 450000 > > > hbase.client.scanner.timeout.period - 450000 > > > hbase.client.scanner.caching - (down to) 50 > > > > > > To no avail. So I stripped the hbase config from hbase-site.xml to bare > > > minumum but I can reproduce it with a striking accuracy. The > minimalistic > > > job reads from a table(c 3500 regions, 17 nodes), uses NullOutputFormat > > but > > > doesn't write to it, mappers's map function doesn't do anything. > > > > > > It starts pretty fast getting through 1.75% of the specified scan in ~1 > > > minute. Then hits 2.5% in ~2m, 3% in ~3m. Then around 4m20s, a massive > > wave > > > of aforementioned OutOfOrderScannerNextException starts pouring in, > > slowing > > > the job down until it fails ~1h later. > > > > > > I checked the nodes memory and disk usage on the individual nodes - all > > > good, open file permissions are set relatively high, we're clearly not > > > hitting the limit. > > > > > > I'm running out of sanity and was wondering if anyone might have any > > ideas? > > > > > > > > > -- > > > *Mateusz Kaczynski* > > > > > > --047d7b5d9acd10891605195b96a9--