Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F3E73C3E1 for ; Thu, 4 Dec 2014 12:49:04 +0000 (UTC) Received: (qmail 67192 invoked by uid 500); 4 Dec 2014 12:49:03 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 67111 invoked by uid 500); 4 Dec 2014 12:49:03 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 75495 invoked by uid 99); 4 Dec 2014 10:02:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Dec 2014 10:02:23 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of zhoushuaifeng@gmail.com designates 209.85.217.172 as permitted sender) Received: from [209.85.217.172] (HELO mail-lb0-f172.google.com) (209.85.217.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Dec 2014 10:01:55 +0000 Received: by mail-lb0-f172.google.com with SMTP id u10so13850994lbd.17 for ; Thu, 04 Dec 2014 02:01:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=gu30k4WxikrjGd8xX4p6demth7gU4C0Q5spJ8x6a0L0=; b=wa/xyMM5ZHB1edY1QsUTXloWsWAXwp1EHvNruNFFDcdWMySFeuIG0ekjyhWkqun9Rj 9D7v/bI+pB/JNDZnXHtgSJTJmrGF6DK+SUAx3KFJGNu7SpMveQigDZxgqF4CytTXmRUv mqt95ImfPQ9YyREOV2xHq8iGHMgUx8QamEdrLVyJmAdpsWp/AXEbLQM1CebV/Eijqt/o 1Yw7lKKc1TF+36Budc8yKqdURBD/6ZRu4riPHL3iirXVHT3/AZ7GpuatzVo0HVXFzmSp 9EmgRIikQV4y7nlE2op5P1bxumRC10UnNEhB+oqJYDSxhhpVT/wrvrh89gLF8BMSAjgh RNPA== MIME-Version: 1.0 X-Received: by 10.152.43.12 with SMTP id s12mr8315718lal.67.1417687269768; Thu, 04 Dec 2014 02:01:09 -0800 (PST) Received: by 10.25.80.145 with HTTP; Thu, 4 Dec 2014 02:01:09 -0800 (PST) In-Reply-To: References: Date: Thu, 4 Dec 2014 18:01:09 +0800 Message-ID: Subject: Re: split failed caused by FileNotFoundException From: =?UTF-8?B?5ZGo5biF6ZSL?= To: dev Content-Type: multipart/alternative; boundary=001a11c23c601e02070509610a65 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c23c601e02070509610a65 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I rechecked the code in 0.98, this problem is solved by check the store object in the compactrunner and cance the compact the compact. HRegion.compact: byte[] cf =3D Bytes.toBytes(store.getColumnFamilyName()); if (stores.get(cf) !=3D store) { LOG.warn("Store " + store.getColumnFamilyName() + " on region " + this + " has been re-instantiated, cancel this compaction request. " + " It may be caused by the roll back of split transaction"); return false; } But, is it better to replease the store object by the new one and continue the compact on the store, instead of cancel? 2014-12-04 15:00 GMT+08:00 =E5=91=A8=E5=B8=85=E9=94=8B : > In our hbase clusters, split sometimes failed because the file to be > splited does not exist in parent region. In 0.94.2, this will cause > regionserver shutdown because the split transction has reached PONR stat= e. > In 0.94.20 or 0.98, split will fail and can roll back, because the split > transction only reach the state offlined_parent. > > In 0.94.2, the error is like below: > 2014-09-23 22:27:55,710 INFO org.apache.hadoop.hbase.catalog.MetaEditor: > Offlined parent region xxxxx in META > 2014-09-23 22:27:55,820 INFO > org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/clean= up > of failed split of xxxxx > Caused by: java.io.IOException: java.io.IOException: > java.io.FileNotFoundException: File does not exist: xxxxx > Caused by: java.io.IOException: java.io.FileNotFoundException: File does > not exist: xxxxx > Caused by: java.io.FileNotFoundException: File does not exist: xxxxx > 2014-09-23 22:27:55,823 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region serve= r > xxx,60020,1411383568857: Abort; we got an error after point-of-no-return > > The reasion of missing files is a little complex, the whole procedure > include two failure split and one compact: > 1) there are too many files in the region and compact is requested, but > not execute because there are many CompactionRequests(compactionRunners) = in > the compaction queue. The compactionRequest hodes the object of the Store= , > and also hodes a storefile list to compact of the store. > > 2) the region size is big enough, and split is requested. the region is > offline during spliting,and the store is closed. but the split failed whe= n > spliting files(for some reason, like io busy, etc. causing time out) > 2014-09-23 18:26:02,738 INFO > org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/clean= up > of failed split of xxxxx; Took too long to split the files and create the > references, aborting split > > 3) split successfully roll back, and the region is online again. During > roll back procedure, a new Store object is created, but the store in the > compaction queue did not removed, so there are two(or maybe more) store > object in regionserver. > > 4) the compaction on the store of the region requested before running, an= d > some storefiles are compact and removed, new bigger storefiles are create= d. > but the store reinitialized in the rollback split procedure doesn't know > the change of the storefiles. > > 5) split on region running again and fail again, because the storefiles i= n > parrent region doesn't exist(removed by compaction). Also, the split > transction doesn't know that there is a new file created by the compactio= n. > In 0.94.2, this error can't be found until the daughter region open, but > it's too late, the split failed at PONR state, and this will causing > regionserver shutdown. In 0.94.20 and 0.98, when doing splitStoreFiles, i= t > will looking into the storefile in the parent region and can found the > error before PONR, so split failure can be roll back. > code in HRegionFileSystem.splitStoreFile: > ... > byte[] lastKey =3D f.createReader().getLastKey(); > > So, this situation is a fatal error in previous 0.94 version, and also a > common bug in the later 0.94 and higher version. And this is also the > reason why sometimes storefile reader is null(closed by the first failure > split). > --001a11c23c601e02070509610a65--