Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of tychang@gmail.com designates
 209.85.160.179 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALte62zxOEDD=GL+7bhC6yFVWZc=4pP2dBbuFerB_EFA1RAnaw@mail.gmail.com>
References: 
 <CAHpPDZ5PyB5hw3sVKL5Ppn8Oub-6_RD-1aoBUJVtZFPJ6rL6Bw@mail.gmail.com>
	<CALte62yqrU5_YJodGVvoNos9V9goRYRZkzVi8w-xnO8hiT7uAQ@mail.gmail.com>
	<CAHpPDZ59xWB7MyS8yA+T9warg4RjB-RdVXXmPHVEgL0PbSw8zw@mail.gmail.com>
	<CAFb6HiJVGFEX4Jyh0KwQv5ZiwDaBJo5owfzfW8sHEmXX-VDzkA@mail.gmail.com>
	<CAHpPDZ5NRiqRZo5WtiopzS8+63EtSUN6frD_a4-fu3nzb32LUg@mail.gmail.com>
	<CAHpPDZ7S4YNMrx8AyfhJFK6k0fEwn7CL0vqP-eXq8-_RToBhwA@mail.gmail.com>
	<CALte62ybnV-=YDdYpcuUz644vjAnkRUPYYKwRRhNqYGbiJFKiQ@mail.gmail.com>
	<CAHpPDZ60skWDYVwgeNYLh=R8-drov8HZrzgt1bK=Ln7UinZTjA@mail.gmail.com>
	<CAHpPDZ78MmOYKEB9bvLP-R6ivExn_yKkp707Dqzo39pizwV5pA@mail.gmail.com>
	<CAFb6HiJ39XFNnCd5VS0Sg4HK711L1SddOA8Dufm8f2aUUDpvAw@mail.gmail.com>
	<CAHpPDZ4wM-wKuVwKuGF9kpxuv32u01p9Gi6gpJj-Wa_PBPLZqA@mail.gmail.com>
	<CALte62zxOEDD=GL+7bhC6yFVWZc=4pP2dBbuFerB_EFA1RAnaw@mail.gmail.com>
Date: Wed, 30 Apr 2014 21:18:35 -0700
Message-ID: 
 <CAHpPDZ7M+YR2P1JT3EGef7AEFm6mRioTn9vtR6mahqYYOLtgTg@mail.gmail.com>
Subject: Re: export snapshot fail sometime due to LeaseExpiredException
From: Tianying Chang <tychang@gmail.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=20cf3011e28772c8ae04f84ef575

--20cf3011e28772c8ae04f84ef575
Content-Type: text/plain; charset=UTF-8

Hi,

I found it is only that cluster I was using for my test has this issue.
When I changed the destination cluster to another one, it is not a problem
anymore. I still don't know what is special about that cluster that cause
the job to  fail sometime, especially on the 2nd, 3rd... run. But at least
I know exportSnapshot is stable.

Thanks a lot for your help.
Tian-Ying


On Wed, Apr 30, 2014 at 3:43 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> bq. 1. delete_snapshot 'myTable'
>
> myTable is a table, not name of a snapshot, right ?
>
>
> HBASE-10766 was not among the list of patches in your earlier email. Can
> you apply the patch and try again ?
>
> Cheers
>
>
> On Wed, Apr 30, 2014 at 3:31 PM, Tianying Chang <tychang@gmail.com> wrote:
>
> > Actually, my testing on a 90G table always succeed, never fail. The
> failed
> > one is a production table which has about 400G and 460 regions.
> >
> > The weird thing is it seems the first run after I refresh the jar(either
> > throttle or non-throttle) always succeed with no failed task. But then
> 2nd,
> > 3rd... will always fail. And the error message is about the destination
> > file does not exist. But since it is the file that it is trying to copy
> > into, this is very strange.
> >
> > BTW, I cleanup the destinattion cluster by doing 3 things:
> > 1. delete_snapshot 'myTable'
> > 2. hadoop dfs -rmr /hbase/.hbase-snapshot/.tmp
> > 3. hadoop dfs -rmr /hbase/.archive/myTable
> >
> > Thanks
> > Tian-Ying
> >
> >
> > On Wed, Apr 30, 2014 at 3:07 PM, Matteo Bertozzi <
> theo.bertozzi@gmail.com
> > >wrote:
> >
> > > can you post your ExportSnapshot.java code?
> > > Is your destination an hbase cluster? if yes do you have HBASE-10766.
> if
> > > not try to export to an hdfs path (not /hbase subdir)
> > > do you have other stuff playing with the files in .archive? or multiple
> > > ExportSnapshot running against the same set of files?
> > >
> > > we have testing for ExportSnapshot with 40G files, so the problem is
> not
> > on
> > > the size.
> > > It may be one of the above, or your lease timeout too low for the
> "busy"
> > > state of your machines
> > >
> > > Matteo
> > >
> > >
> > >
> > > On Wed, Apr 30, 2014 at 2:55 PM, Tianying Chang <tychang@gmail.com>
> > wrote:
> > >
> > > > I think it is not directly caused by the throttle. On the 2nd run on
> > the
> > > > non-throttle jar, the LeaseExpiredException shows up again(for big
> > file).
> > > > So it does seem like the exportSnapshot is not reliable for big file.
> > > >
> > > > The weird thing is when I replace the jar and restart the cluster,
> the
> > > > first run of the big table always succeed. But then the later run
> > always
> > > > fail with these LeaseExpiredException.  Smaller table has no problem
> no
> > > > matter how many times I re-run.
> > > >
> > > > Thanks
> > > > Tian-Ying
> > > >
> > > >
> > > > On Wed, Apr 30, 2014 at 2:24 PM, Tianying Chang <tychang@gmail.com>
> > > wrote:
> > > >
> > > > > Ted,
> > > > >
> > > > > it seems it is due to the Jira-11083: throttle bandwidth during
> > > snapshot
> > > > > export <https://issues.apache.org/jira/browse/HBASE-11083> After I
> > > > revert
> > > > > it back, the job succeed again. It seems even when I set the
> throttle
> > > > > bandwidth high, like 200M, iftop shows much lower value. Maybe the
> > > > throttle
> > > > > is sleeping longer than it supposed to? But I am not clear why a
> slow
> > > > copy
> > > > > job can cause LeaseExpiredException. Any idea?
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > > > No lease on
> > > >
> > >
> >
> /hbase/.archive/rich_pin_data_v1/b50ab10bb4812acc2e9fa6c564c9adef/d/bac3c661a897466aaf1706a9e1bd9e9a
> > > > File does not exist. Holder DFSClient_NONMAPREDUCE_-2096088484_1 does
> > not
> > > > have any open files.
> > > > >       at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
> > > > >       at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
> > > > >       at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:2454)
> > > > >       at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2431)
> > > > >       at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:536)
> > > > >       at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:335)
> > > > >       at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$
> > > > >
> > > > >
> > > > > Thanks
> > > > > Tian-Ying
> > > > >
> > > > >
> > > > > On Wed, Apr 30, 2014 at 1:25 PM, Ted Yu <yuzhihong@gmail.com>
> wrote:
> > > > >
> > > > >> Tianying:
> > > > >> Have you checked audit log on namenode for deletion event
> > > corresponding
> > > > to
> > > > >> the files involved in LeaseExpiredException ?
> > > > >>
> > > > >> Cheers
> > > > >>
> > > > >>
> > > > >> On Wed, Apr 30, 2014 at 10:44 AM, Tianying Chang <
> tychang@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >> > This time re-run passed (although with many failed/retry tasks)
> > with
> > > > my
> > > > >> > throttle bandwidth as 200M(although by iftop, it never reach
> close
> > > to
> > > > >> that
> > > > >> > number). Is there a way to increase the lease expire time for
> low
> > > > >> throttle
> > > > >> > bandwidth for individual export job?
> > > > >> >
> > > > >> > Thanks
> > > > >> > Tian-Ying
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Wed, Apr 30, 2014 at 10:17 AM, Tianying Chang <
> > tychang@gmail.com
> > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > yes, I am using the bandwidth throttle feature. The export job
> > of
> > > > this
> > > > >> > > table actually succeed for its first run. When I rerun it (for
> > my
> > > > >> robust
> > > > >> > > testing) it seems never pass.  I am wondering if it has some
> > werid
> > > > >> state
> > > > >> > (I
> > > > >> > > did clean up the target cluster even removed
> > > > >> > > /hbase/.archive/rich_pint_data_v1 folder)
> > > > >> > >
> > > > >> > > It seems even if I set the throttle value really large, it
> still
> > > > fail.
> > > > >> > And
> > > > >> > > I think even after I replace the jar back to the one without
> > > > >> throttle, it
> > > > >> > > still fail for re-run.
> > > > >> > >
> > > > >> > > Is there some way that I can increase the lease to be very
> large
> > > to
> > > > >> test
> > > > >> > > it out?
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > On Wed, Apr 30, 2014 at 10:02 AM, Matteo Bertozzi <
> > > > >> > theo.bertozzi@gmail.com
> > > > >> > > > wrote:
> > > > >> > >
> > > > >> > >> the file is the file in export, so you are creating that
> file.
> > > > >> > >> do you have the bandwidth throttle on?
> > > > >> > >>
> > > > >> > >> I'm thinking that the file is slow writing: e.g. write(few
> > bytes)
> > > > >> wait
> > > > >> > >> write(few bytes)
> > > > >> > >> and on the wait your lease expire
> > > > >> > >> or something like that can happen if your MR job is stuck in
> > > > someway
> > > > >> > (slow
> > > > >> > >> machine or similar) and it is not writing within the lease
> > > timeout
> > > > >> > >>
> > > > >> > >> Matteo
> > > > >> > >>
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> On Wed, Apr 30, 2014 at 9:53 AM, Tianying Chang <
> > > tychang@gmail.com
> > > > >
> > > > >> > >> wrote:
> > > > >> > >>
> > > > >> > >> > we are using
> > > > >> > >> >
> > > > >> > >> > Hadoop 2.0.0-cdh4.2.0 and hbase 0.94.7. We also backported
> > > > several
> > > > >> > >> snapshot
> > > > >> > >> > related jira, e.g 10111(verify snapshot), 11083 (bandwidth
> > > > >> throttle in
> > > > >> > >> > exportSnapshot)
> > > > >> > >> >
> > > > >> > >> > I found when the  LeaseExpiredException first reported,
> that
> > > file
> > > > >> > indeed
> > > > >> > >> > not there, and the map task retry. And I verifified couple
> > > > minutes
> > > > >> > >> later,
> > > > >> > >> > that HFile does exist under /.archive. But the retry map
> task
> > > > still
> > > > >> > >> > complain the same error of file  not exist...
> > > > >> > >> >
> > > > >> > >> > I will check the namenode log for the
> LeaseExpiredException.
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >> > Thanks
> > > > >> > >> >
> > > > >> > >> > Tian-Ying
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >> > On Wed, Apr 30, 2014 at 9:33 AM, Ted Yu <
> yuzhihong@gmail.com
> > >
> > > > >> wrote:
> > > > >> > >> >
> > > > >> > >> > > Can you give us the hbase and hadoop releases you're
> using
> > ?
> > > > >> > >> > >
> > > > >> > >> > > Can you check namenode log around the time
> > > > LeaseExpiredException
> > > > >> was
> > > > >> > >> > > encountered ?
> > > > >> > >> > >
> > > > >> > >> > > Cheers
> > > > >> > >> > >
> > > > >> > >> > >
> > > > >> > >> > > On Wed, Apr 30, 2014 at 9:20 AM, Tianying Chang <
> > > > >> tychang@gmail.com>
> > > > >> > >> > wrote:
> > > > >> > >> > >
> > > > >> > >> > > > Hi,
> > > > >> > >> > > >
> > > > >> > >> > > > When I export large table with 460+ regions, I saw the
> > > > >> > >> exportSnapshot
> > > > >> > >> > job
> > > > >> > >> > > > fail sometime (not all the time). The error of the map
> > task
> > > > is
> > > > >> > >> below:
> > > > >> > >> > > But I
> > > > >> > >> > > > verified the file highlighted below, it does exist.
> > Smaller
> > > > >> table
> > > > >> > >> seems
> > > > >> > >> > > > always pass. Any idea? Is it because it is too big and
> > get
> > > > >> session
> > > > >> > >> > > timeout?
> > > > >> > >> > > >
> > > > >> > >> > > >
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > > > >> > >> > > > No lease on
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> /hbase/.archive/rich_pin_data_v1/7713d5331180cb610834ba1c4ebbb9b3/d/eef3642f49244547bb6606d4d0f15f1f
> > > > >> > >> > > > File does not exist. Holder
> > > > DFSClient_NONMAPREDUCE_279781617_1
> > > > >> > does
> > > > >> > >> > > > not have any open files.
> > > > >> > >> > > >         at
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
> > > > >> > >> > > >         at
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
> > > > >> > >> > > >         at
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183)
> > > > >> > >> > > >         at
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481)
> > > > >> > >> > > >         at
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
> > > > >> > >> > > >         at
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
> > > > >> > >> > > >         at org.apache.hadoop.ipc.ProtobufR
> > > > >> > >> > > >
> > > > >> > >> > > >
> > > > >> > >> > > >
> > > > >> > >> > > > Thanks
> > > > >> > >> > > >
> > > > >> > >> > > > Tian-Ying
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

--20cf3011e28772c8ae04f84ef575--