Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 89BCD17787 for ; Mon, 23 Feb 2015 07:50:37 +0000 (UTC) Received: (qmail 63714 invoked by uid 500); 23 Feb 2015 07:50:35 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 63646 invoked by uid 500); 23 Feb 2015 07:50:35 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 63634 invoked by uid 99); 23 Feb 2015 07:50:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Feb 2015 07:50:35 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of arinto@gmail.com designates 209.85.192.177 as permitted sender) Received: from [209.85.192.177] (HELO mail-pd0-f177.google.com) (209.85.192.177) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Feb 2015 07:50:30 +0000 Received: by pdbfp1 with SMTP id fp1so23539220pdb.9 for ; Sun, 22 Feb 2015 23:47:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=KdV1li8QMsU7bNIdwbVJuSyHKKlGZzUBdM5WMab2xB0=; b=l9IGqjlJw/6RRYk+jiUPWF/LaoBI+wWPOcUJ2MKGhvpcatU+QzatLGMQGq7rPXok7N kpXMhUhWwy1gyrbnVLgYIikPofKSs0PVRBuWefoSUV4UfkAU/wOHViZex/+mggTVY4ms tSNRZKP2YvtTil7jynZSzupkctReN0yCPNSBGNtMnyn5ERXsyG/5MdfmEljZaLenavTE RtNtglh1Tg8WcutggkRnF0pfIz6ubtgbqrdHEk68bLPtUc6AUMnmkn759kO6jiuhInpI DOc4WNz/Hmd47VgJILV1lIVEnEHkifTYI3JiEnGLk7+GqjIwmk35msQnlHlz8+oNT683 syBw== X-Received: by 10.68.212.6 with SMTP id ng6mr17283624pbc.38.1424677675290; Sun, 22 Feb 2015 23:47:55 -0800 (PST) MIME-Version: 1.0 Received: by 10.70.42.237 with HTTP; Sun, 22 Feb 2015 23:47:34 -0800 (PST) From: Arinto Murdopo Date: Mon, 23 Feb 2015 15:47:34 +0800 Message-ID: Subject: HBase Region always in transition + corrupt HDFS To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=e89a8ff1c36ec15550050fbc9eb4 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8ff1c36ec15550050fbc9eb4 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi all, We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop 2.0.0-cdh4.6.0). For all of our tables, we set the replication factor to 1 (dfs.replication =3D 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS usage (now we realize we should set this value to at least 2, because "failure is a norm" in distributed systems). Due to the amount of data, at some point, we have low disk space in HDFS and one of our DNs was down. Now we have these problems in HBase and HDFS although we have recovered our DN. *Issue#1*. Some of HBase region always in transition. '*hbase hbck -repair*= ' is stuck because it's waiting for region transition to finish. Some output *hbase(main):003:0> status 'detailed'* *12 regionsInTransition* * plr_id_insta_media_live,\x02:;6;7;398962:3:399a49:653:64,1421565172917.1528= f288473632aca2636443574a6ba1. state=3DOPENING, ts=3D1424227696897, server=3Dnull* * plr_sg_insta_media_live,\x0098;522:997;8798665a64;67879,1410768824800.2c79b= bc5c0dc2d2b39c04c8abc0a90ff. state=3DOFFLINE, ts=3D1424227714203, server=3Dnull* * plr_sg_insta_media_live,\x00465892:9935773828;a4459;649,1410767723471.55097= cfc60bc9f50303dadb02abcd64b. state=3DOPENING, ts=3D1424227701234, server=3Dnull* * plr_sg_insta_media_live,\x00474973488232837733a38744,1410767723471.740d6655= afb74a2ff421c6ef16037f57. state=3DOPENING, ts=3D1424227708053, server=3Dnull* * plr_id_insta_media_live,\x02::449::4;:466;3988a6432677;3,1419435100617.7caf= 3d749dce37037eec9ccc29d272a1. state=3DOPENING, ts=3D1424227701484, server=3Dnull* * plr_sg_insta_media_live,\x05779793546323;::4:4a3:8227928,1418845792479.81c4= da129ae5b7b204d5373d9e0fea3d. state=3DOPENING, ts=3D1424227705353, server=3Dnull* * plr_sg_insta_media_live,\x009;5:686348963:33:5a5634887,1410769837567.8a9ded= 24960a7787ca016e2073b24151. state=3DOPENING, ts=3D1424227706293, server=3Dnull* * plr_sg_insta_media_live,\x0375;6;7377578;84226a7663792,1418980694076.a1e1c9= 8f646ee899010f19a9c693c67c. state=3DOPENING, ts=3D1424227680569, server=3Dnull* * plr_sg_insta_media_live,\x018;3826368274679364a3;;73457;,1421425643816.b04f= fda1b2024bac09c9e6246fb7b183. state=3DOPENING, ts=3D1424227680538, server=3Dnull* * plr_sg_insta_media_live,\x0154752;22:43377542:a:86:239,1410771044924.c57d6b= 4d23f21d3e914a91721a99ce12. state=3DOPENING, ts=3D1424227710847, server=3Dnull* * plr_sg_insta_media_live,\x0069;7;9384697:;8685a885485:,1410767928822.c7b5e5= 3cdd9e1007117bcaa199b30d1c. state=3DOPENING, ts=3D1424227700962, server=3Dnull* * plr_sg_insta_media_live,\x04994537646:78233569a3467:987;7,1410787903804.cd4= 9ec64a0a417aa11949c2bc2d3df6e. state=3DOPENING, ts=3D1424227691774, server=3Dnull* *Issue#2*. The next step that we do is to check HDFS file status using '*hd= fs fsck /*'. It shows that the filesystem '/' is corrupted with these statistics * Total size: 15494284950796 B (Total open files size: 17179869184 B)* * Total dirs: 9198* * Total files: 124685 (Files currently being written: 21)* * Total blocks (validated): 219620 (avg. block size 70550427 B) (Total open file blocks (not validated): 144)* * ********************************* * CORRUPT FILES: 42* * MISSING BLOCKS: 142* * MISSING SIZE: 14899184084 B* * CORRUPT BLOCKS: 142* * ********************************* * Corrupt blocks: 142* * Number of data-nodes: 14* * Number of racks: 1* *FSCK ended at Tue Feb 17 17:25:18 SGT 2015 in 3026 milliseconds* *The filesystem under path '/' is CORRUPT* So it seems that HDFS loses some of its block due to DN failures and since the dfs.replication factor is 1, it could not recover the missing blocks. *Issue#3*. Although '*hbase hbck -repair*' is stuck, we are able to run '*h= base hbck -fixHdfsHoles*'. We notice this following error messages (I copied some of them to represent each type of error messages that we have). - *ERROR: Region { meta =3D> plr_id_insta_media_live,\x02:;6;7;398962:3:399a49:653:64,1421565172917.1528= f288473632aca2636443574a6ba1., hdfs =3D> hdfs://nameservice1/hbase/plr_id_insta_media_live/1528f2884* *73632aca2636443574a6ba1, deployed =3D> } not deployed on any region serve= r.* - *ERROR: Region { meta =3D> null, hdfs =3D> hdfs://nameservice1/hbase/plr_sg_insta_media_live/8473d25be5980c169bff13cf9= 0229939, deployed =3D> } on HDFS, but not listed in META or deployed on any region server* *- ERROR: Region { meta =3D> plr_sg_insta_media_live,\x0293:729769;975376;2a33995622;3,1421985489851.881= 9ebd296f075513056be4bbd30ee9c., hdfs =3D> null, deployed =3D> } found in META, but not in HDFS or deployed= on any region server.* -ERROR: There is a hole in the region chain between \x099599464:7:5;3595;8a:57868;95 and \x099;56535:4632439643a82826562:. You need to create a new .regioninfo and region dir in hdfs to plug the hole. -ERROR: Last region should end with an empty key. You need to create a new region and regioninfo in HDFS to plug the hole. Now to fix this issue, we plan to perform this following action items: 1. Move or delete corrupted files in HDFS 2. Repair HBase by deleting the reference of corrupted files/blocks from HBase meta tables (it=E2=80=99s okay to lost some of the data) 3. Or create empty HFiles as shown in http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/31308 And our questions are: 1. Is it safe to move or delete corrupted files in HDFS? Can we make HBase to ignore those files and delete corresponding HBase files? 2. Any comments on our action items? Best regards, Arinto www.otnira.com --e89a8ff1c36ec15550050fbc9eb4--