Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6980810F4B for ; Thu, 1 May 2014 04:19:13 +0000 (UTC) Received: (qmail 1372 invoked by uid 500); 1 May 2014 04:19:06 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 1085 invoked by uid 500); 1 May 2014 04:19:06 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 1076 invoked by uid 99); 1 May 2014 04:19:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 May 2014 04:19:05 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of tychang@gmail.com designates 209.85.160.179 as permitted sender) Received: from [209.85.160.179] (HELO mail-yk0-f179.google.com) (209.85.160.179) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 May 2014 04:18:59 +0000 Received: by mail-yk0-f179.google.com with SMTP id 19so1378124ykq.24 for ; Wed, 30 Apr 2014 21:18:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=nKVtEkVf6gp6+0Behxif2vG9gbJiMTNDaNBx+TfoDkc=; b=ZOGfVLK/2Jz2uKXLYBGs7RskC3NGCaVsZ52SP3TuivLNJXgIBs+HvIX/Gmw5REXyCX Vjcq3AoJT/Xxf80mwTIe7ljHi6+I0bw3IC0qLhRL0gs+q4SRt1i7Lg/WFsvVZWdAKxyT BfWY1x5RLpeR3x1c6bNLnDJiaGWFye0aek5XEvCt89iQUs+RNOAnrgFH13h+SvIZEMay ebZi186VV8FweQ/0IJ56FIc21vnwCMu9x5CmDHG8tsulA+6yBxc7tS+zFBq5W+9WmpsX XtwH708M0c6P59M76FFqrYGzBBu9AljQaknjSoCvTcwFI06crLswadKXafM957fQ/PIn vNow== MIME-Version: 1.0 X-Received: by 10.236.119.99 with SMTP id m63mr11603704yhh.65.1398917915912; Wed, 30 Apr 2014 21:18:35 -0700 (PDT) Received: by 10.170.166.67 with HTTP; Wed, 30 Apr 2014 21:18:35 -0700 (PDT) In-Reply-To: References: Date: Wed, 30 Apr 2014 21:18:35 -0700 Message-ID: Subject: Re: export snapshot fail sometime due to LeaseExpiredException From: Tianying Chang To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=20cf3011e28772c8ae04f84ef575 X-Virus-Checked: Checked by ClamAV on apache.org --20cf3011e28772c8ae04f84ef575 Content-Type: text/plain; charset=UTF-8 Hi, I found it is only that cluster I was using for my test has this issue. When I changed the destination cluster to another one, it is not a problem anymore. I still don't know what is special about that cluster that cause the job to fail sometime, especially on the 2nd, 3rd... run. But at least I know exportSnapshot is stable. Thanks a lot for your help. Tian-Ying On Wed, Apr 30, 2014 at 3:43 PM, Ted Yu wrote: > bq. 1. delete_snapshot 'myTable' > > myTable is a table, not name of a snapshot, right ? > > > HBASE-10766 was not among the list of patches in your earlier email. Can > you apply the patch and try again ? > > Cheers > > > On Wed, Apr 30, 2014 at 3:31 PM, Tianying Chang wrote: > > > Actually, my testing on a 90G table always succeed, never fail. The > failed > > one is a production table which has about 400G and 460 regions. > > > > The weird thing is it seems the first run after I refresh the jar(either > > throttle or non-throttle) always succeed with no failed task. But then > 2nd, > > 3rd... will always fail. And the error message is about the destination > > file does not exist. But since it is the file that it is trying to copy > > into, this is very strange. > > > > BTW, I cleanup the destinattion cluster by doing 3 things: > > 1. delete_snapshot 'myTable' > > 2. hadoop dfs -rmr /hbase/.hbase-snapshot/.tmp > > 3. hadoop dfs -rmr /hbase/.archive/myTable > > > > Thanks > > Tian-Ying > > > > > > On Wed, Apr 30, 2014 at 3:07 PM, Matteo Bertozzi < > theo.bertozzi@gmail.com > > >wrote: > > > > > can you post your ExportSnapshot.java code? > > > Is your destination an hbase cluster? if yes do you have HBASE-10766. > if > > > not try to export to an hdfs path (not /hbase subdir) > > > do you have other stuff playing with the files in .archive? or multiple > > > ExportSnapshot running against the same set of files? > > > > > > we have testing for ExportSnapshot with 40G files, so the problem is > not > > on > > > the size. > > > It may be one of the above, or your lease timeout too low for the > "busy" > > > state of your machines > > > > > > Matteo > > > > > > > > > > > > On Wed, Apr 30, 2014 at 2:55 PM, Tianying Chang > > wrote: > > > > > > > I think it is not directly caused by the throttle. On the 2nd run on > > the > > > > non-throttle jar, the LeaseExpiredException shows up again(for big > > file). > > > > So it does seem like the exportSnapshot is not reliable for big file. > > > > > > > > The weird thing is when I replace the jar and restart the cluster, > the > > > > first run of the big table always succeed. But then the later run > > always > > > > fail with these LeaseExpiredException. Smaller table has no problem > no > > > > matter how many times I re-run. > > > > > > > > Thanks > > > > Tian-Ying > > > > > > > > > > > > On Wed, Apr 30, 2014 at 2:24 PM, Tianying Chang > > > wrote: > > > > > > > > > Ted, > > > > > > > > > > it seems it is due to the Jira-11083: throttle bandwidth during > > > snapshot > > > > > export After I > > > > revert > > > > > it back, the job succeed again. It seems even when I set the > throttle > > > > > bandwidth high, like 200M, iftop shows much lower value. Maybe the > > > > throttle > > > > > is sleeping longer than it supposed to? But I am not clear why a > slow > > > > copy > > > > > job can cause LeaseExpiredException. Any idea? > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > > > > No lease on > > > > > > > > > > /hbase/.archive/rich_pin_data_v1/b50ab10bb4812acc2e9fa6c564c9adef/d/bac3c661a897466aaf1706a9e1bd9e9a > > > > File does not exist. Holder DFSClient_NONMAPREDUCE_-2096088484_1 does > > not > > > > have any open files. > > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396) > > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387) > > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:2454) > > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2431) > > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:536) > > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:335) > > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$ > > > > > > > > > > > > > > > Thanks > > > > > Tian-Ying > > > > > > > > > > > > > > > On Wed, Apr 30, 2014 at 1:25 PM, Ted Yu > wrote: > > > > > > > > > >> Tianying: > > > > >> Have you checked audit log on namenode for deletion event > > > corresponding > > > > to > > > > >> the files involved in LeaseExpiredException ? > > > > >> > > > > >> Cheers > > > > >> > > > > >> > > > > >> On Wed, Apr 30, 2014 at 10:44 AM, Tianying Chang < > tychang@gmail.com > > > > > > > >> wrote: > > > > >> > > > > >> > This time re-run passed (although with many failed/retry tasks) > > with > > > > my > > > > >> > throttle bandwidth as 200M(although by iftop, it never reach > close > > > to > > > > >> that > > > > >> > number). Is there a way to increase the lease expire time for > low > > > > >> throttle > > > > >> > bandwidth for individual export job? > > > > >> > > > > > >> > Thanks > > > > >> > Tian-Ying > > > > >> > > > > > >> > > > > > >> > > > > > >> > On Wed, Apr 30, 2014 at 10:17 AM, Tianying Chang < > > tychang@gmail.com > > > > > > > > >> > wrote: > > > > >> > > > > > >> > > yes, I am using the bandwidth throttle feature. The export job > > of > > > > this > > > > >> > > table actually succeed for its first run. When I rerun it (for > > my > > > > >> robust > > > > >> > > testing) it seems never pass. I am wondering if it has some > > werid > > > > >> state > > > > >> > (I > > > > >> > > did clean up the target cluster even removed > > > > >> > > /hbase/.archive/rich_pint_data_v1 folder) > > > > >> > > > > > > >> > > It seems even if I set the throttle value really large, it > still > > > > fail. > > > > >> > And > > > > >> > > I think even after I replace the jar back to the one without > > > > >> throttle, it > > > > >> > > still fail for re-run. > > > > >> > > > > > > >> > > Is there some way that I can increase the lease to be very > large > > > to > > > > >> test > > > > >> > > it out? > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > On Wed, Apr 30, 2014 at 10:02 AM, Matteo Bertozzi < > > > > >> > theo.bertozzi@gmail.com > > > > >> > > > wrote: > > > > >> > > > > > > >> > >> the file is the file in export, so you are creating that > file. > > > > >> > >> do you have the bandwidth throttle on? > > > > >> > >> > > > > >> > >> I'm thinking that the file is slow writing: e.g. write(few > > bytes) > > > > >> wait > > > > >> > >> write(few bytes) > > > > >> > >> and on the wait your lease expire > > > > >> > >> or something like that can happen if your MR job is stuck in > > > > someway > > > > >> > (slow > > > > >> > >> machine or similar) and it is not writing within the lease > > > timeout > > > > >> > >> > > > > >> > >> Matteo > > > > >> > >> > > > > >> > >> > > > > >> > >> > > > > >> > >> On Wed, Apr 30, 2014 at 9:53 AM, Tianying Chang < > > > tychang@gmail.com > > > > > > > > > >> > >> wrote: > > > > >> > >> > > > > >> > >> > we are using > > > > >> > >> > > > > > >> > >> > Hadoop 2.0.0-cdh4.2.0 and hbase 0.94.7. We also backported > > > > several > > > > >> > >> snapshot > > > > >> > >> > related jira, e.g 10111(verify snapshot), 11083 (bandwidth > > > > >> throttle in > > > > >> > >> > exportSnapshot) > > > > >> > >> > > > > > >> > >> > I found when the LeaseExpiredException first reported, > that > > > file > > > > >> > indeed > > > > >> > >> > not there, and the map task retry. And I verifified couple > > > > minutes > > > > >> > >> later, > > > > >> > >> > that HFile does exist under /.archive. But the retry map > task > > > > still > > > > >> > >> > complain the same error of file not exist... > > > > >> > >> > > > > > >> > >> > I will check the namenode log for the > LeaseExpiredException. > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > Thanks > > > > >> > >> > > > > > >> > >> > Tian-Ying > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > On Wed, Apr 30, 2014 at 9:33 AM, Ted Yu < > yuzhihong@gmail.com > > > > > > > >> wrote: > > > > >> > >> > > > > > >> > >> > > Can you give us the hbase and hadoop releases you're > using > > ? > > > > >> > >> > > > > > > >> > >> > > Can you check namenode log around the time > > > > LeaseExpiredException > > > > >> was > > > > >> > >> > > encountered ? > > > > >> > >> > > > > > > >> > >> > > Cheers > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > On Wed, Apr 30, 2014 at 9:20 AM, Tianying Chang < > > > > >> tychang@gmail.com> > > > > >> > >> > wrote: > > > > >> > >> > > > > > > >> > >> > > > Hi, > > > > >> > >> > > > > > > > >> > >> > > > When I export large table with 460+ regions, I saw the > > > > >> > >> exportSnapshot > > > > >> > >> > job > > > > >> > >> > > > fail sometime (not all the time). The error of the map > > task > > > > is > > > > >> > >> below: > > > > >> > >> > > But I > > > > >> > >> > > > verified the file highlighted below, it does exist. > > Smaller > > > > >> table > > > > >> > >> seems > > > > >> > >> > > > always pass. Any idea? Is it because it is too big and > > get > > > > >> session > > > > >> > >> > > timeout? > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > >> > > > > > > > > > > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > > > > >> > >> > > > No lease on > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > >> > > > > > > > > > > /hbase/.archive/rich_pin_data_v1/7713d5331180cb610834ba1c4ebbb9b3/d/eef3642f49244547bb6606d4d0f15f1f > > > > >> > >> > > > File does not exist. Holder > > > > DFSClient_NONMAPREDUCE_279781617_1 > > > > >> > does > > > > >> > >> > > > not have any open files. > > > > >> > >> > > > at > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > >> > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396) > > > > >> > >> > > > at > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > >> > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387) > > > > >> > >> > > > at > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > >> > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183) > > > > >> > >> > > > at > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > >> > > > > > > > > > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481) > > > > >> > >> > > > at > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > >> > > > > > > > > > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) > > > > >> > >> > > > at > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > >> > > > > > > > > > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) > > > > >> > >> > > > at org.apache.hadoop.ipc.ProtobufR > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > Thanks > > > > >> > >> > > > > > > > >> > >> > > > Tian-Ying > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > --20cf3011e28772c8ae04f84ef575--