Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AC41D106AA for ; Wed, 30 Apr 2014 22:32:25 +0000 (UTC) Received: (qmail 63422 invoked by uid 500); 30 Apr 2014 22:32:20 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 63353 invoked by uid 500); 30 Apr 2014 22:32:20 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 63345 invoked by uid 99); 30 Apr 2014 22:32:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2014 22:32:20 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of tychang@gmail.com designates 209.85.160.174 as permitted sender) Received: from [209.85.160.174] (HELO mail-yk0-f174.google.com) (209.85.160.174) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2014 22:32:14 +0000 Received: by mail-yk0-f174.google.com with SMTP id 20so2110339yks.19 for ; Wed, 30 Apr 2014 15:31:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=cfPMKc1wOKIQDuUga5ia7vFB/NdDcXL2WEcTBiShS+M=; b=QZ+OyBbSZWFmN3qXVCLl0EbR6c5UlyP3lN1U2viaZaAMi4LlzyI6FahB5IOaDT78y1 N4zeErWoD1Hjkd+LOSxIPAXeRGNQo+ZRLJ3B6PHxLqzqtpD7ejxdC3DwwDYlaAssVzoI s6UlD98Kofr00fWPnhF5pOo5+SPoLTtLEoKYVNk+tzAlHTwelWT4OQulnXuAyWqlbS/f Lli/kzje+9OtB4r/Qhi4fs/zBYAtKDXP3pqB6PXcp4kZllTS2yqnGLo176CqncDWNKtc 9qjm9ShMhEF6jGOr7hhlmvy3ZS7gIsWFInmkfD4zskNyC0rS9AdaX4eROXI3762wu77L Vhrw== MIME-Version: 1.0 X-Received: by 10.236.132.48 with SMTP id n36mr9747298yhi.149.1398897111288; Wed, 30 Apr 2014 15:31:51 -0700 (PDT) Received: by 10.170.166.67 with HTTP; Wed, 30 Apr 2014 15:31:51 -0700 (PDT) In-Reply-To: References: Date: Wed, 30 Apr 2014 15:31:51 -0700 Message-ID: Subject: Re: export snapshot fail sometime due to LeaseExpiredException From: Tianying Chang To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=20cf300e56e9657b2304f84a1d38 X-Virus-Checked: Checked by ClamAV on apache.org --20cf300e56e9657b2304f84a1d38 Content-Type: text/plain; charset=UTF-8 Actually, my testing on a 90G table always succeed, never fail. The failed one is a production table which has about 400G and 460 regions. The weird thing is it seems the first run after I refresh the jar(either throttle or non-throttle) always succeed with no failed task. But then 2nd, 3rd... will always fail. And the error message is about the destination file does not exist. But since it is the file that it is trying to copy into, this is very strange. BTW, I cleanup the destinattion cluster by doing 3 things: 1. delete_snapshot 'myTable' 2. hadoop dfs -rmr /hbase/.hbase-snapshot/.tmp 3. hadoop dfs -rmr /hbase/.archive/myTable Thanks Tian-Ying On Wed, Apr 30, 2014 at 3:07 PM, Matteo Bertozzi wrote: > can you post your ExportSnapshot.java code? > Is your destination an hbase cluster? if yes do you have HBASE-10766. if > not try to export to an hdfs path (not /hbase subdir) > do you have other stuff playing with the files in .archive? or multiple > ExportSnapshot running against the same set of files? > > we have testing for ExportSnapshot with 40G files, so the problem is not on > the size. > It may be one of the above, or your lease timeout too low for the "busy" > state of your machines > > Matteo > > > > On Wed, Apr 30, 2014 at 2:55 PM, Tianying Chang wrote: > > > I think it is not directly caused by the throttle. On the 2nd run on the > > non-throttle jar, the LeaseExpiredException shows up again(for big file). > > So it does seem like the exportSnapshot is not reliable for big file. > > > > The weird thing is when I replace the jar and restart the cluster, the > > first run of the big table always succeed. But then the later run always > > fail with these LeaseExpiredException. Smaller table has no problem no > > matter how many times I re-run. > > > > Thanks > > Tian-Ying > > > > > > On Wed, Apr 30, 2014 at 2:24 PM, Tianying Chang > wrote: > > > > > Ted, > > > > > > it seems it is due to the Jira-11083: throttle bandwidth during > snapshot > > > export After I > > revert > > > it back, the job succeed again. It seems even when I set the throttle > > > bandwidth high, like 200M, iftop shows much lower value. Maybe the > > throttle > > > is sleeping longer than it supposed to? But I am not clear why a slow > > copy > > > job can cause LeaseExpiredException. Any idea? > > > > > > > > > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > > No lease on > > > /hbase/.archive/rich_pin_data_v1/b50ab10bb4812acc2e9fa6c564c9adef/d/bac3c661a897466aaf1706a9e1bd9e9a > > File does not exist. Holder DFSClient_NONMAPREDUCE_-2096088484_1 does not > > have any open files. > > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396) > > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387) > > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:2454) > > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2431) > > > at > > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:536) > > > at > > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:335) > > > at > > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$ > > > > > > > > > Thanks > > > Tian-Ying > > > > > > > > > On Wed, Apr 30, 2014 at 1:25 PM, Ted Yu wrote: > > > > > >> Tianying: > > >> Have you checked audit log on namenode for deletion event > corresponding > > to > > >> the files involved in LeaseExpiredException ? > > >> > > >> Cheers > > >> > > >> > > >> On Wed, Apr 30, 2014 at 10:44 AM, Tianying Chang > > >> wrote: > > >> > > >> > This time re-run passed (although with many failed/retry tasks) with > > my > > >> > throttle bandwidth as 200M(although by iftop, it never reach close > to > > >> that > > >> > number). Is there a way to increase the lease expire time for low > > >> throttle > > >> > bandwidth for individual export job? > > >> > > > >> > Thanks > > >> > Tian-Ying > > >> > > > >> > > > >> > > > >> > On Wed, Apr 30, 2014 at 10:17 AM, Tianying Chang > > > >> > wrote: > > >> > > > >> > > yes, I am using the bandwidth throttle feature. The export job of > > this > > >> > > table actually succeed for its first run. When I rerun it (for my > > >> robust > > >> > > testing) it seems never pass. I am wondering if it has some werid > > >> state > > >> > (I > > >> > > did clean up the target cluster even removed > > >> > > /hbase/.archive/rich_pint_data_v1 folder) > > >> > > > > >> > > It seems even if I set the throttle value really large, it still > > fail. > > >> > And > > >> > > I think even after I replace the jar back to the one without > > >> throttle, it > > >> > > still fail for re-run. > > >> > > > > >> > > Is there some way that I can increase the lease to be very large > to > > >> test > > >> > > it out? > > >> > > > > >> > > > > >> > > > > >> > > On Wed, Apr 30, 2014 at 10:02 AM, Matteo Bertozzi < > > >> > theo.bertozzi@gmail.com > > >> > > > wrote: > > >> > > > > >> > >> the file is the file in export, so you are creating that file. > > >> > >> do you have the bandwidth throttle on? > > >> > >> > > >> > >> I'm thinking that the file is slow writing: e.g. write(few bytes) > > >> wait > > >> > >> write(few bytes) > > >> > >> and on the wait your lease expire > > >> > >> or something like that can happen if your MR job is stuck in > > someway > > >> > (slow > > >> > >> machine or similar) and it is not writing within the lease > timeout > > >> > >> > > >> > >> Matteo > > >> > >> > > >> > >> > > >> > >> > > >> > >> On Wed, Apr 30, 2014 at 9:53 AM, Tianying Chang < > tychang@gmail.com > > > > > >> > >> wrote: > > >> > >> > > >> > >> > we are using > > >> > >> > > > >> > >> > Hadoop 2.0.0-cdh4.2.0 and hbase 0.94.7. We also backported > > several > > >> > >> snapshot > > >> > >> > related jira, e.g 10111(verify snapshot), 11083 (bandwidth > > >> throttle in > > >> > >> > exportSnapshot) > > >> > >> > > > >> > >> > I found when the LeaseExpiredException first reported, that > file > > >> > indeed > > >> > >> > not there, and the map task retry. And I verifified couple > > minutes > > >> > >> later, > > >> > >> > that HFile does exist under /.archive. But the retry map task > > still > > >> > >> > complain the same error of file not exist... > > >> > >> > > > >> > >> > I will check the namenode log for the LeaseExpiredException. > > >> > >> > > > >> > >> > > > >> > >> > Thanks > > >> > >> > > > >> > >> > Tian-Ying > > >> > >> > > > >> > >> > > > >> > >> > On Wed, Apr 30, 2014 at 9:33 AM, Ted Yu > > >> wrote: > > >> > >> > > > >> > >> > > Can you give us the hbase and hadoop releases you're using ? > > >> > >> > > > > >> > >> > > Can you check namenode log around the time > > LeaseExpiredException > > >> was > > >> > >> > > encountered ? > > >> > >> > > > > >> > >> > > Cheers > > >> > >> > > > > >> > >> > > > > >> > >> > > On Wed, Apr 30, 2014 at 9:20 AM, Tianying Chang < > > >> tychang@gmail.com> > > >> > >> > wrote: > > >> > >> > > > > >> > >> > > > Hi, > > >> > >> > > > > > >> > >> > > > When I export large table with 460+ regions, I saw the > > >> > >> exportSnapshot > > >> > >> > job > > >> > >> > > > fail sometime (not all the time). The error of the map task > > is > > >> > >> below: > > >> > >> > > But I > > >> > >> > > > verified the file highlighted below, it does exist. Smaller > > >> table > > >> > >> seems > > >> > >> > > > always pass. Any idea? Is it because it is too big and get > > >> session > > >> > >> > > timeout? > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > >> > > > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > > >> > >> > > > No lease on > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > >> > > > /hbase/.archive/rich_pin_data_v1/7713d5331180cb610834ba1c4ebbb9b3/d/eef3642f49244547bb6606d4d0f15f1f > > >> > >> > > > File does not exist. Holder > > DFSClient_NONMAPREDUCE_279781617_1 > > >> > does > > >> > >> > > > not have any open files. > > >> > >> > > > at > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > >> > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396) > > >> > >> > > > at > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > >> > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387) > > >> > >> > > > at > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > >> > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183) > > >> > >> > > > at > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > >> > > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481) > > >> > >> > > > at > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > >> > > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) > > >> > >> > > > at > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > >> > > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) > > >> > >> > > > at org.apache.hadoop.ipc.ProtobufR > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > Thanks > > >> > >> > > > > > >> > >> > > > Tian-Ying > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > > >> > > > > >> > > > >> > > > > > > > > > --20cf300e56e9657b2304f84a1d38--