hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tianying Chang <tych...@gmail.com>
Subject Re: export snapshot fail sometime due to LeaseExpiredException
Date Wed, 30 Apr 2014 22:31:51 GMT
Actually, my testing on a 90G table always succeed, never fail. The failed
one is a production table which has about 400G and 460 regions.

The weird thing is it seems the first run after I refresh the jar(either
throttle or non-throttle) always succeed with no failed task. But then 2nd,
3rd... will always fail. And the error message is about the destination
file does not exist. But since it is the file that it is trying to copy
into, this is very strange.

BTW, I cleanup the destinattion cluster by doing 3 things:
1. delete_snapshot 'myTable'
2. hadoop dfs -rmr /hbase/.hbase-snapshot/.tmp
3. hadoop dfs -rmr /hbase/.archive/myTable

Thanks
Tian-Ying


On Wed, Apr 30, 2014 at 3:07 PM, Matteo Bertozzi <theo.bertozzi@gmail.com>wrote:

> can you post your ExportSnapshot.java code?
> Is your destination an hbase cluster? if yes do you have HBASE-10766. if
> not try to export to an hdfs path (not /hbase subdir)
> do you have other stuff playing with the files in .archive? or multiple
> ExportSnapshot running against the same set of files?
>
> we have testing for ExportSnapshot with 40G files, so the problem is not on
> the size.
> It may be one of the above, or your lease timeout too low for the "busy"
> state of your machines
>
> Matteo
>
>
>
> On Wed, Apr 30, 2014 at 2:55 PM, Tianying Chang <tychang@gmail.com> wrote:
>
> > I think it is not directly caused by the throttle. On the 2nd run on the
> > non-throttle jar, the LeaseExpiredException shows up again(for big file).
> > So it does seem like the exportSnapshot is not reliable for big file.
> >
> > The weird thing is when I replace the jar and restart the cluster, the
> > first run of the big table always succeed. But then the later run always
> > fail with these LeaseExpiredException.  Smaller table has no problem no
> > matter how many times I re-run.
> >
> > Thanks
> > Tian-Ying
> >
> >
> > On Wed, Apr 30, 2014 at 2:24 PM, Tianying Chang <tychang@gmail.com>
> wrote:
> >
> > > Ted,
> > >
> > > it seems it is due to the Jira-11083: throttle bandwidth during
> snapshot
> > > export <https://issues.apache.org/jira/browse/HBASE-11083> After I
> > revert
> > > it back, the job succeed again. It seems even when I set the throttle
> > > bandwidth high, like 200M, iftop shows much lower value. Maybe the
> > throttle
> > > is sleeping longer than it supposed to? But I am not clear why a slow
> > copy
> > > job can cause LeaseExpiredException. Any idea?
> > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > No lease on
> >
> /hbase/.archive/rich_pin_data_v1/b50ab10bb4812acc2e9fa6c564c9adef/d/bac3c661a897466aaf1706a9e1bd9e9a
> > File does not exist. Holder DFSClient_NONMAPREDUCE_-2096088484_1 does not
> > have any open files.
> > >       at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
> > >       at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
> > >       at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:2454)
> > >       at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2431)
> > >       at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:536)
> > >       at
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:335)
> > >       at
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$
> > >
> > >
> > > Thanks
> > > Tian-Ying
> > >
> > >
> > > On Wed, Apr 30, 2014 at 1:25 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> > >
> > >> Tianying:
> > >> Have you checked audit log on namenode for deletion event
> corresponding
> > to
> > >> the files involved in LeaseExpiredException ?
> > >>
> > >> Cheers
> > >>
> > >>
> > >> On Wed, Apr 30, 2014 at 10:44 AM, Tianying Chang <tychang@gmail.com>
> > >> wrote:
> > >>
> > >> > This time re-run passed (although with many failed/retry tasks) with
> > my
> > >> > throttle bandwidth as 200M(although by iftop, it never reach close
> to
> > >> that
> > >> > number). Is there a way to increase the lease expire time for low
> > >> throttle
> > >> > bandwidth for individual export job?
> > >> >
> > >> > Thanks
> > >> > Tian-Ying
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Apr 30, 2014 at 10:17 AM, Tianying Chang <tychang@gmail.com
> >
> > >> > wrote:
> > >> >
> > >> > > yes, I am using the bandwidth throttle feature. The export job
of
> > this
> > >> > > table actually succeed for its first run. When I rerun it (for
my
> > >> robust
> > >> > > testing) it seems never pass.  I am wondering if it has some
werid
> > >> state
> > >> > (I
> > >> > > did clean up the target cluster even removed
> > >> > > /hbase/.archive/rich_pint_data_v1 folder)
> > >> > >
> > >> > > It seems even if I set the throttle value really large, it still
> > fail.
> > >> > And
> > >> > > I think even after I replace the jar back to the one without
> > >> throttle, it
> > >> > > still fail for re-run.
> > >> > >
> > >> > > Is there some way that I can increase the lease to be very large
> to
> > >> test
> > >> > > it out?
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Wed, Apr 30, 2014 at 10:02 AM, Matteo Bertozzi <
> > >> > theo.bertozzi@gmail.com
> > >> > > > wrote:
> > >> > >
> > >> > >> the file is the file in export, so you are creating that
file.
> > >> > >> do you have the bandwidth throttle on?
> > >> > >>
> > >> > >> I'm thinking that the file is slow writing: e.g. write(few
bytes)
> > >> wait
> > >> > >> write(few bytes)
> > >> > >> and on the wait your lease expire
> > >> > >> or something like that can happen if your MR job is stuck
in
> > someway
> > >> > (slow
> > >> > >> machine or similar) and it is not writing within the lease
> timeout
> > >> > >>
> > >> > >> Matteo
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> On Wed, Apr 30, 2014 at 9:53 AM, Tianying Chang <
> tychang@gmail.com
> > >
> > >> > >> wrote:
> > >> > >>
> > >> > >> > we are using
> > >> > >> >
> > >> > >> > Hadoop 2.0.0-cdh4.2.0 and hbase 0.94.7. We also backported
> > several
> > >> > >> snapshot
> > >> > >> > related jira, e.g 10111(verify snapshot), 11083 (bandwidth
> > >> throttle in
> > >> > >> > exportSnapshot)
> > >> > >> >
> > >> > >> > I found when the  LeaseExpiredException first reported,
that
> file
> > >> > indeed
> > >> > >> > not there, and the map task retry. And I verifified
couple
> > minutes
> > >> > >> later,
> > >> > >> > that HFile does exist under /.archive. But the retry
map task
> > still
> > >> > >> > complain the same error of file  not exist...
> > >> > >> >
> > >> > >> > I will check the namenode log for the LeaseExpiredException.
> > >> > >> >
> > >> > >> >
> > >> > >> > Thanks
> > >> > >> >
> > >> > >> > Tian-Ying
> > >> > >> >
> > >> > >> >
> > >> > >> > On Wed, Apr 30, 2014 at 9:33 AM, Ted Yu <yuzhihong@gmail.com>
> > >> wrote:
> > >> > >> >
> > >> > >> > > Can you give us the hbase and hadoop releases you're
using ?
> > >> > >> > >
> > >> > >> > > Can you check namenode log around the time
> > LeaseExpiredException
> > >> was
> > >> > >> > > encountered ?
> > >> > >> > >
> > >> > >> > > Cheers
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > On Wed, Apr 30, 2014 at 9:20 AM, Tianying Chang
<
> > >> tychang@gmail.com>
> > >> > >> > wrote:
> > >> > >> > >
> > >> > >> > > > Hi,
> > >> > >> > > >
> > >> > >> > > > When I export large table with 460+ regions,
I saw the
> > >> > >> exportSnapshot
> > >> > >> > job
> > >> > >> > > > fail sometime (not all the time). The error
of the map task
> > is
> > >> > >> below:
> > >> > >> > > But I
> > >> > >> > > > verified the file highlighted below, it does
exist. Smaller
> > >> table
> > >> > >> seems
> > >> > >> > > > always pass. Any idea? Is it because it is
too big and get
> > >> session
> > >> > >> > > timeout?
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > >> > >> > > > No lease on
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> /hbase/.archive/rich_pin_data_v1/7713d5331180cb610834ba1c4ebbb9b3/d/eef3642f49244547bb6606d4d0f15f1f
> > >> > >> > > > File does not exist. Holder
> > DFSClient_NONMAPREDUCE_279781617_1
> > >> > does
> > >> > >> > > > not have any open files.
> > >> > >> > > >         at
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
> > >> > >> > > >         at
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
> > >> > >> > > >         at
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183)
> > >> > >> > > >         at
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481)
> > >> > >> > > >         at
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
> > >> > >> > > >         at
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
> > >> > >> > > >         at org.apache.hadoop.ipc.ProtobufR
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > > > Thanks
> > >> > >> > > >
> > >> > >> > > > Tian-Ying
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message