Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 14D75DB14 for ; Thu, 12 Jul 2012 03:53:30 +0000 (UTC) Received: (qmail 38479 invoked by uid 500); 12 Jul 2012 03:53:28 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 38291 invoked by uid 500); 12 Jul 2012 03:53:27 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 38263 invoked by uid 99); 12 Jul 2012 03:53:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Jul 2012 03:53:26 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of oohrong@gmail.com designates 209.85.160.41 as permitted sender) Received: from [209.85.160.41] (HELO mail-pb0-f41.google.com) (209.85.160.41) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Jul 2012 03:53:20 +0000 Received: by pbbrp2 with SMTP id rp2so3691347pbb.14 for ; Wed, 11 Jul 2012 20:53:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=vqdn0RfMuo5cPvNq3Blki0yOV0S5hscebn6Kf5hvkQk=; b=PtR2v5uLaBepPE9TXtbN8VhARCoqXZq/oDdvmShpG+ZVnL8pB5wnJktKfiJvveBw5U vjyKSnvdeMETqnKSCCaSHD6tKSKWeqRJK6GmKxDg9mW+Ihd0RTOFf5AIMxxXiXd6M7EG YE8W6KV9NPVXMbbz9lPB9NhORMSJkqH+VK1tknT8sLE63jL9QCHMUMz9VgC7XYKBvGvf w0ulr7CkQaYcpXI+7It3vP5Itiptqiiy4SlYWLkwW91QyO17wytqcnMtxnfwdJnXnmct 9V+OlWFMGjX1Fk9D2MzIzBClDv22dtBTuO4rVVe6EDj3my2/1/tgW87QRvvMRm7OrHh1 kNfA== MIME-Version: 1.0 Received: by 10.68.233.193 with SMTP id ty1mr1806253pbc.47.1342065180452; Wed, 11 Jul 2012 20:53:00 -0700 (PDT) Received: by 10.67.14.74 with HTTP; Wed, 11 Jul 2012 20:52:59 -0700 (PDT) In-Reply-To: References: Date: Thu, 12 Jul 2012 12:52:59 +0900 Message-ID: Subject: Re: Mapred job failing with LeaseException From: Ooh Rong To: user@hbase.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org That's exactly why I am so confused! I can't think of anything in my code that would take more than 60-sec. and block consecutive next() calls. My "client" prints some progress infos to standard out and my "map task" (probably) just writes SequenceFiles to HDFS. (I didn't actually write the map task. I just fixed Export class from hbase.mapreduce package by adding some filters to the Scan object which is passed to TableMapReduceUtil.initTableMapperJob().) But I'll definitely look into this, just to make sure. Yesterday I did some more test. I got rid of the Filters that I added to the Export class and added this "filtering" functionality inside the map(). Logically this is exactly the same code as the one with the Filters, except for the fact that the filtering takes place in a different process. i.e. region server vs. map task Here's some code: public static Job createSubmittableJob(Configuration conf, String[] args) throws IOException { ... List filters =3D new ArrayList(); filters.add(new SingleColumnValueFilter(CF_1, FLAG_1, CompareOp.EQUAL, LONG_ZERO)); filters.add(new SingleColumnValueFilter(CF_1, FLAG_2, CompareOp.EQUAL, LONG_ZERO)); filters.add(new SingleColumnValueFilter(CF_1, FLAG_3, CompareOp.EQUAL, LONG_ZERO)); s.setFilter(new FilterList(Operator.MUST_PASS_ALL, filters)); ... public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException { if (Bytes.equals(value.getValue(CF_1, FLAG_1), LONG_ZERO) && Bytes.equals(value.getValue(CF_1, FLAG_3), LONG_ZERO) && Bytes.equals(value.getValue(CF_1, B_FLAG), LONG_ZERO) ) { ... } Now the fun part. This time the job finished successfully! There were some failed tasks(i.e. only 0.09% with speculative execution turned off) but there were no LeaseExceptions. About 23% of these failed tasks showed org.apache.hadoop.hbase.client.ScannerTimeoutException and about 17% failed to report status and got killed. The rest which makes up 60% of the failed tasks was a connection problem(e.g. connection reset by peer, broken pipe) to the name node which I think is understandable. Any kind of comments are welcome. Thanks, On Thu, Jul 12, 2012 at 8:22 AM, Suraj Varma wrote: > The reason you get LeaseExceptions is that the time between two > scanner.next() calls exceeded your hbase.regionserver.lease.period > setting which defaults to 60s. Whether it is your "client" or your > "map task", if it opens a Scan against HBase, scanner.next() should > continue to get invoked within this lease period - else, the client is > considered dead and the lease is expired. When this "dead" client > comes back and tries to do a scanner.next(), it gets a LeaseException. > > There are several threads on this ... so - google for "hbase scanner > leaseexception" and such. See: > http://mail-archives.apache.org/mod_mbox/hbase-user/200903.mbox/%3Cfa0348= 0d0903110823l5678e8dem353f345483799c5@mail.gmail.com%3E > http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/10225 > > Are you doing some processing in between two scanner.next() calls that > takes over 60s over time? > --Suraj > > > On Wed, Jul 11, 2012 at 1:23 AM, =EC=B5=9C=EC=9A=B0=EC=9A=A9 wrote: >> Hi, >> >> I'm running a cluster of few hundred servers with Cloudera's CDH3u4 >> HBase+Hadoop. >> and having trouble with what I think is a simple map job which uses >> HBase table as an input. >> My mapper code is org.apache.hadoop.hbase.mapreduce.Export with a few >> SingleColumnValueFilter(i.e. a FilterList) added to the Scan object. >> The job seems to progress without any trouble at first, but after >> about 5~7 minutes when little over 50% of map tasks complete, >> I suddenly see a lot of LeaseExceptions and the job ultimately fails. >> >> Here's the stack print I see on my failed tasks: >> >> org.apache.hadoop.hbase.regionserver.LeaseException: >> org.apache.hadoop.hbase.regionserver.LeaseException: lease >> '7595201038414594449' does not exist at >> org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)= at >> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.ja= va:1881) >> at >> sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor= Impl.java:25) >> at >> java.lang.reflect.Method.invoke(Method.java:597) at >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) at >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:103= 9) at >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA= ccessorImpl.java:39) >> at >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons= tructorAccessorImpl.java:27) >> at >> java.lang.reflect.Constructor.newInstance(Constructor.java:513) at >> >> I kind of had a similar problem when I was scanning a particular >> region using ResultScanner in a single-threaded manner with the same >> filters mentioned above >> but I assumed it wouldn't be a problem in mapred since it's more >> resilient to single task errors. >> >> I tried row caching with Scan.setCaching(), lowered >> mapred.tasktracker.map.tasks.maximum property in hopes of reducing the >> total loads on region servers, but nothing worked. >> >> Could this be a filter performance problem preventing region servers >> from responding before lease expiration? >> Or maybe a long sequence of rows don't match my filter list and the >> lease expires before it finally hits the one that does. >> >> I'm kind of new to Hadoop map-reduce and HBase, so any pointers would >> be very much appreciated. >> Thanks.