Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B4DA018336 for ; Mon, 8 Jun 2015 18:26:02 +0000 (UTC) Received: (qmail 63265 invoked by uid 500); 8 Jun 2015 18:26:02 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 63219 invoked by uid 500); 8 Jun 2015 18:26:02 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 63205 invoked by uid 99); 8 Jun 2015 18:26:02 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jun 2015 18:26:02 +0000 Date: Mon, 8 Jun 2015 18:26:02 +0000 (UTC) From: "Enis Soztutar (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-13853) ITBLL data loss in HBase-1.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-13853?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D145= 77603#comment-14577603 ]=20 Enis Soztutar commented on HBASE-13853: --------------------------------------- bq. I extracted from Enis patch stuff that was not addressed by HBASE-13811= =E2=80=93 the changes to ITBLL and extra info output in WALSplitter. This = at least should go in. Can do in a new issue or here. Sounds good. Lets do it here. We can change the issue title.=20 > ITBLL data loss in HBase-1.1 > ---------------------------- > > Key: HBASE-13853 > URL: https://issues.apache.org/jira/browse/HBASE-13853 > Project: HBase > Issue Type: Bug > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Priority: Blocker > Fix For: 2.0.0, 1.1.1 > > Attachments: 13853.remainder.txt, hbase-13853_v1-branch-1.1.patch > > > After 2 days of log mining with ITBLL Search tool (thanks Stack, BTW for = the tool), the sequence of events are like this:=20 > One of mappers from Search reportedly finds missing keys in some WAL file= : > {code} > 2015-06-07 06:49:24,573 INFO [main] org.apache.hadoop.mapred.MapTask: Pro= cessing split: hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hba= se/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C143363632= 0201.default.1433637872643 (-9223372036854775808:9223372036854775807) lengt= h:132679657 2015-06-07 06:49:24,639 INFO [main] org.apache.hadoop.hbase.map= reduce.WALInputFormat: Opening reader for hdfs://os-enis-dal-test-jun-4-1.o= penstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.opensta= cklocal%2C16020%2C1433636320201.default.1433637872643 (-9223372036854775808= :9223372036854775807) length:132679657 > 2015-06-07 06:50:16,383 INFO [main] org.apache.hadoop.hbase.test.Integrat= ionTestBigLinkedList$Search: Loaded keys to find: count=3D2384870 2015-06-0= 7 06:50:16,607 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigL= inkedList$Search: Found cell=3D!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0= D\xBC/meta:prev/1433637873293/Put/vlen=3D16/seqid=3D0 2015-06-07 06:50:16,6= 63 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Se= arch: Found cell=3D!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:c= ount/1433637873293/Put/vlen=3D8/seqid=3D0 > 2015-06-07 06:50:16,664 INFO [main] org.apache.hadoop.hbase.test.Integrat= ionTestBigLinkedList$Search: Found cell=3D!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C= \xDA\x1D\x0D\xBC/meta:client/1433637873293/Put/vlen=3D71/seqid=3D0 2015-06-= 07 06:50:16,671 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBig= LinkedList$Search: Found cell=3D$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\x= CEP\x12"\x04/meta:prev/1433637873293/Put/vlen=3D16/seqid=3D0 > 2015-06-07 06:50:16,672 INFO [main] org.apache.hadoop.hbase.test.Integrat= ionTestBigLinkedList$Search: Found cell=3D$\x1A\x99\x06\x86\xE7\x07\xA2\xA7= \xB2\xFB\xCEP\x12"\x04/meta:count/1433637873293/Put/vlen=3D8/seqid=3D0 2015= -06-07 06:50:16,678 INFO [main] org.apache.hadoop.hbase.test.IntegrationTes= tBigLinkedList$Search: Found cell=3D$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\x= FB\xCEP\x12"\x04/meta:client/1433637873293/Put/vlen=3D71/seqid=3D0 > 2015-06-07 06:50:16,679 INFO [main] org.apache.hadoop.hbase.test.Integrat= ionTestBigLinkedList$Search: Found cell=3D\x1F\x80#\x95\xAE:i=3D)S\x01\x08 = \xD6\xFF\xD5/meta:prev/1433637873293/Put/vlen=3D16/seqid=3D0 > {code} > {{hlog -p}} confirms the missing keys are there in the file:=20 > {code} > Sequence=3D7276 , region=3Db086e29f909c9790446cd457c1ea7674 at write time= stamp=3DSun Jun 07 00:44:33 UTC 2015 > row=3D!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=3Dmeta:pr= ev > value: \x1B\xF5\xF3^\x8D\xB4\x85\xE3\xF4wS\x9A]\x0D\xABe > row=3D!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=3Dmeta:co= unt > value: \x00\x00\x00\x00\x002\x87l > row=3D!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=3Dmeta:cl= ient > value: Job: job_1433466891829_0002 Task: attempt_1433466891829_0002_m= _000003_0 > {code} > When the RS gets killing from CM, master does WAL splitting:=20 > {code} > ./hbase-hbase-regionserver-os-enis-dal-test-jun-4-2.log:2015-06-07 00:46:= 12,581 INFO [RS_LOG_REPLAY_OPS-os-enis-dal-test-jun-4-2:16020-0] wal.WALSp= litter: Processed 2971 edits across 4 regions; edits skipped=3D740; log fil= e=3Dhdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WAL= s/os-enis-dal-test-jun-4-4.openstacklocal,16020,1433636320201-splitting/os-= enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.143363= 7872643, length=3D132679657, corrupted=3Dfalse, progress failed=3Dfalse > {code} > The edits with Sequence=3D7276 should be in this recovered.edits file:=20 > {code} > 2015-06-07 00:46:12,574 INFO [split-log-closeStream-2] wal.WALSplitter: = Closed wap hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/d= ata/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7= 674/recovered.edits/0000000000000007276.temp (wrote 739 edits in 3950ms) > 2015-06-07 00:46:12,580 INFO [split-log-closeStream-2] wal.WALSplitter: = Rename hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/= data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/= recovered.edits/0000000000000007276.temp to hdfs://os-enis-dal-test-jun-4-1= .openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedL= ist/b086e29f909c9790446cd457c1ea7674/recovered.edits/0000000000000008752 > {code} > Notice that the file {{0000000000000008752}} should contain {{(wrote 739 = edits in 3950ms)}}.=20 > When the region gets opened again, it replays the recovered.edits files: > {code} > 2015-06-07 00:46:17,764 DEBUG [RS_OPEN_REGION-os-enis-dal-test-jun-4-5:16= 020-0] regionserver.HRegion: Found 4 recovered edits file(s) under hdfs://o= s-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/In= tegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674 > {code} > However, it does not see ANY entry in the recovered.edits file {{0000000= 000000008752}}:=20 > {code} > 2015-06-07 00:46:17,796 DEBUG [RS_OPEN_REGION-os-enis-dal-test-jun-4-5:16= 020-0] regionserver.HRegion: Applied 0, skipped 0, firstSequenceIdInLog=3D-= 1, maxSequenceIdInLog=3D-1, path=3Dhdfs://os-enis-dal-test-jun-4-1.openstac= klocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e= 29f909c9790446cd457c1ea7674/recovered.edits/0000000000000008752 > {code} > Looking at the length of the file, it is very small which means that the = edits are not written.=20 > Now, If you are still with me, I think the reason why we do not write any= edit to {{0000000000000008752}} although we say we have written 739 edits = is due to HBASE-12405.=20 > Specifically, we pass {{FSHLog.getEarliestMemstoreSeqNum() - 1}} as the s= tores flush seqId, which gets saved in the master and used with {{LastSeque= nceId}}. =20 > The WALSplitter skips the cells per store using {{filterCellByStore()}} w= hich is logical. However, the data loss happens, because {{FSHLog.getEarl= iestMemstoreSeqNum()}} ignores flushing edits. For example, if we have seqI= d =3D 10 to 100 as flushing, and seqId 101 in the new memstore, and heartbe= at to master during the flush will carry 101 as the earliest flush seqId. i= f the flush does not complete (due to RS getting killed) this results in in= correctly skipped edits.=20 > The WAL replay code also uses this kind of filtering so this may also exp= lain some of the failures with DLR.=20 -- This message was sent by Atlassian JIRA (v6.3.4#6332)