Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of oohrong@gmail.com designates
 209.85.160.41 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAN89x=w+L0zQjVkHn+64MhaQgavRJWmXc47JFtBFt0=EZRBEgw@mail.gmail.com>
References: 
 <CA+wNMFvdkJD3YdGfH-VGjpm3AcNx17JpDVo2fMrdwtKTRZG0pg@mail.gmail.com>
	<CAN89x=w+L0zQjVkHn+64MhaQgavRJWmXc47JFtBFt0=EZRBEgw@mail.gmail.com>
Date: Thu, 12 Jul 2012 12:52:59 +0900
Message-ID: 
 <CA+wNMFsD51554p0-vxmcRdBbsSdNoK3WSuNWVnjytekPXfZZ8g@mail.gmail.com>
Subject: Re: Mapred job failing with LeaseException
From: Ooh Rong <oohrong@gmail.com>
To: user@hbase.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

That's exactly why I am so confused!
I can't think of anything in my code that would take more than 60-sec.
and block consecutive next() calls. My "client" prints some progress
infos to standard out and my "map task" (probably) just writes
SequenceFiles to HDFS.
(I didn't actually write the map task. I just fixed Export class from
hbase.mapreduce package by adding some filters to the Scan object
which is passed to TableMapReduceUtil.initTableMapperJob().)
But I'll definitely look into this, just to make sure.

Yesterday I did some more test. I got rid of the Filters that I added
to the Export class and added this "filtering" functionality inside
the map().
Logically this is exactly the same code as the one with the Filters,
except for the fact that the filtering takes place in a different
process. i.e. region server vs. map task

Here's some code:
<Before>
public static Job createSubmittableJob(Configuration conf, String[]
args) throws IOException {
    ...
    List<Filter> filters =3D new ArrayList<Filter>();
    filters.add(new SingleColumnValueFilter(CF_1, FLAG_1,
CompareOp.EQUAL, LONG_ZERO));
    filters.add(new SingleColumnValueFilter(CF_1, FLAG_2,
CompareOp.EQUAL, LONG_ZERO));
    filters.add(new SingleColumnValueFilter(CF_1, FLAG_3,
CompareOp.EQUAL, LONG_ZERO));
    s.setFilter(new FilterList(Operator.MUST_PASS_ALL, filters));
    ...
<After>
public void map(ImmutableBytesWritable row, Result value, Context
context) throws IOException {
    if (Bytes.equals(value.getValue(CF_1, FLAG_1), LONG_ZERO) &&
		Bytes.equals(value.getValue(CF_1, FLAG_3), LONG_ZERO) &&
		Bytes.equals(value.getValue(CF_1, B_FLAG), LONG_ZERO) ) {
	...
    }

Now the fun part. This time the job finished successfully! There were
some failed tasks(i.e. only 0.09% with speculative execution turned
off) but there were no LeaseExceptions.
About 23% of these failed tasks showed
org.apache.hadoop.hbase.client.ScannerTimeoutException and about 17%
failed to report status and got killed. The rest which makes up 60% of
the failed tasks was a connection problem(e.g. connection reset by
peer, broken pipe) to the name node which I think is understandable.

Any kind of comments are welcome.
Thanks,


On Thu, Jul 12, 2012 at 8:22 AM, Suraj Varma <svarma.ng@gmail.com> wrote:
> The reason you get LeaseExceptions is that the time between two
> scanner.next() calls exceeded your hbase.regionserver.lease.period
> setting which defaults to 60s. Whether it is your "client" or your
> "map task", if it opens a Scan against HBase, scanner.next() should
> continue to get invoked within this lease period - else, the client is
> considered dead and the lease is expired. When this "dead" client
> comes back and tries to do a scanner.next(), it gets a LeaseException.
>
> There are several threads on this ... so - google for "hbase scanner
> leaseexception" and such. See:
> http://mail-archives.apache.org/mod_mbox/hbase-user/200903.mbox/%3Cfa0348=
0d0903110823l5678e8dem353f345483799c5@mail.gmail.com%3E
>  http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/10225
>
> Are you doing some processing in between two scanner.next() calls that
> takes over 60s over time?
> --Suraj
>
>
> On Wed, Jul 11, 2012 at 1:23 AM, =EC=B5=9C=EC=9A=B0=EC=9A=A9 <oohrong@gma=
il.com> wrote:
>> Hi,
>>
>> I'm running a cluster of few hundred servers with Cloudera's CDH3u4
>> HBase+Hadoop.
>> and having trouble with what I think is a simple map job which uses
>> HBase table as an input.
>> My mapper code is org.apache.hadoop.hbase.mapreduce.Export with a few
>> SingleColumnValueFilter(i.e. a FilterList) added to the Scan object.
>> The job seems to progress without any trouble at first, but after
>> about 5~7 minutes when little over 50% of map tasks complete,
>> I suddenly see a lot of LeaseExceptions and the job ultimately fails.
>>
>> Here's the stack print I see on my failed tasks:
>>
>> org.apache.hadoop.hbase.regionserver.LeaseException:
>> org.apache.hadoop.hbase.regionserver.LeaseException: lease
>> '7595201038414594449' does not exist at
>> org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)=
 at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.ja=
va:1881)
>> at
>> sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor=
Impl.java:25)
>> at
>> java.lang.reflect.Method.invoke(Method.java:597) at
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:103=
9) at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA=
ccessorImpl.java:39)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons=
tructorAccessorImpl.java:27)
>> at
>> java.lang.reflect.Constructor.newInstance(Constructor.java:513) at
>>
>> I kind of had a similar problem when I was scanning a particular
>> region using ResultScanner in a single-threaded manner with the same
>> filters mentioned above
>> but I assumed it wouldn't be a problem in mapred since it's more
>> resilient to single task errors.
>>
>> I tried row caching with Scan.setCaching(), lowered
>> mapred.tasktracker.map.tasks.maximum property in hopes of reducing the
>> total loads on region servers, but nothing worked.
>>
>> Could this be a filter performance problem preventing region servers
>> from responding before lease expiration?
>> Or maybe a long sequence of rows don't match my filter list and the
>> lease expires before it finally hits the one that does.
>>
>> I'm kind of new to Hadoop map-reduce and HBase, so any pointers would
>> be very much appreciated.
>> Thanks.