Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of zhoushuaifeng@gmail.com
 designates 209.85.217.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAK_cYZCJV+f9WyvXvr37ZyZuV43z=AZr5865pi1Z6KnsbBy1fg@mail.gmail.com>
References: 
 <CAK_cYZCJV+f9WyvXvr37ZyZuV43z=AZr5865pi1Z6KnsbBy1fg@mail.gmail.com>
Date: Thu, 4 Dec 2014 18:01:09 +0800
Message-ID: 
 <CAK_cYZDbVcfR-AXu16QxSBC7TDTsOmXAnJ66dxBrvERA-mh+7g@mail.gmail.com>
Subject: Re: split failed caused by FileNotFoundException
From: =?UTF-8?B?5ZGo5biF6ZSL?= <zhoushuaifeng@gmail.com>
To: dev <dev@hbase.apache.org>
Content-Type: multipart/alternative; boundary=001a11c23c601e02070509610a65

--001a11c23c601e02070509610a65
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I rechecked the code in 0.98, this problem is solved by check the store
object in the compactrunner and cance the compact the compact.
HRegion.compact:

      byte[] cf =3D Bytes.toBytes(store.getColumnFamilyName());
      if (stores.get(cf) !=3D store) {
        LOG.warn("Store " + store.getColumnFamilyName() + " on region " +
this
            + " has been re-instantiated, cancel this compaction request. "
            + " It may be caused by the roll back of split transaction");
        return false;
      }


But, is it better to replease the store object by the new one and continue
the compact on the store, instead of cancel?


2014-12-04 15:00 GMT+08:00 =E5=91=A8=E5=B8=85=E9=94=8B <zhoushuaifeng@gmail=
.com>:

> In our hbase clusters, split sometimes failed because the file to be
> splited does not exist in parent region. In 0.94.2, this will cause
> regionserver shutdown because the split transction has reached  PONR stat=
e.
> In 0.94.20 or 0.98, split will fail and can roll back, because the split
> transction only reach  the state offlined_parent.
>
> In 0.94.2, the error is like below:
> 2014-09-23 22:27:55,710 INFO org.apache.hadoop.hbase.catalog.MetaEditor:
> Offlined parent region xxxxx in META
> 2014-09-23 22:27:55,820 INFO
> org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/clean=
up
> of failed split of xxxxx
> Caused by: java.io.IOException: java.io.IOException:
> java.io.FileNotFoundException: File does not exist: xxxxx
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does
> not exist: xxxxx
> Caused by: java.io.FileNotFoundException: File does not exist: xxxxx
> 2014-09-23 22:27:55,823 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region serve=
r
> xxx,60020,1411383568857: Abort; we got an error after point-of-no-return
>
> The reasion of missing files is a little complex, the whole procedure
> include two failure split and one compact:
> 1) there are too many files in the region and compact is requested, but
> not execute because there are many CompactionRequests(compactionRunners) =
in
> the compaction queue. The compactionRequest hodes the object of the Store=
,
> and also hodes a storefile list to compact of the store.
>
> 2) the region size is big enough, and split is requested. the region is
> offline during spliting,and the store is closed. but the split failed whe=
n
> spliting files(for some reason, like io busy, etc. causing time out)
> 2014-09-23 18:26:02,738 INFO
> org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/clean=
up
> of failed split of xxxxx; Took too long to split the files and create the
> references, aborting split
>
> 3) split successfully roll back, and the region is online again. During
> roll back procedure, a new Store object is created, but the store in the
> compaction queue did not removed, so there are two(or maybe more) store
> object in regionserver.
>
> 4) the compaction on the store of the region requested before running, an=
d
> some storefiles are compact and removed, new bigger storefiles are create=
d.
> but the store reinitialized in the rollback split procedure doesn't know
> the change of the storefiles.
>
> 5) split on region running again and fail again, because the storefiles i=
n
> parrent region doesn't exist(removed by compaction). Also, the split
> transction doesn't know that there is a new file created by the compactio=
n.
> In 0.94.2, this error can't be found until the daughter region open, but
> it's too late, the split failed at PONR state, and this will causing
> regionserver shutdown. In 0.94.20 and 0.98, when doing splitStoreFiles, i=
t
> will looking into the storefile in the parent region and can found the
> error before PONR, so split failure can be roll back.
>      code in HRegionFileSystem.splitStoreFile:
>      ...
>      byte[] lastKey =3D f.createReader().getLastKey();
>
> So, this situation is a fatal error in previous 0.94 version, and also a
> common bug in the later 0.94 and higher version. And this is also the
> reason why sometimes storefile reader is null(closed by the first failure
> split).
>

--001a11c23c601e02070509610a65--