hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13853) ITBLL data loss in HBase-1.1
Date Mon, 08 Jun 2015 06:44:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576665#comment-14576665
] 

Hadoop QA commented on HBASE-13853:
-----------------------------------

{color:green}+1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12738293/hbase-13853_v1-branch-1.1.patch
  against branch-1.1 branch at commit 9f43a3bea6a996a93f3c960801a0200b52419b34.
  ATTACHMENT ID: 12738293

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 6 new or modified
tests.

    {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions
(2.4.1 2.5.2 2.6.0)

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of
javac compiler warnings.

    {color:green}+1 protoc{color}.  The applied patch does not increase the total number of
protoc compiler warnings.

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any warning messages.

    {color:green}+1 checkstyle{color}.  The applied patch does not increase the total number
of checkstyle errors

    {color:green}+1 findbugs{color}.  The patch does not introduce any  new Findbugs (version
2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number
of release audit warnings.

    {color:green}+1 lineLengths{color}.  The patch does not introduce lines longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

    {color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14322//testReport/
Release Findbugs (version 2.0.3) 	warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14322//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14322//artifact/patchprocess/checkstyle-aggregate.html

  Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14322//console

This message is automatically generated.

> ITBLL data loss in HBase-1.1
> ----------------------------
>
>                 Key: HBASE-13853
>                 URL: https://issues.apache.org/jira/browse/HBASE-13853
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>            Priority: Blocker
>             Fix For: 2.0.0, 1.1.1
>
>         Attachments: hbase-13853_v1-branch-1.1.patch
>
>
> After 2 days of log mining with ITBLL Search tool (thanks Stack, BTW for the tool), the
sequence of events are like this: 
> One of mappers from Search reportedly finds missing keys in some WAL file:
> {code}
> 2015-06-07 06:49:24,573 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split:
hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643
(-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 06:49:24,639 INFO [main]
org.apache.hadoop.hbase.mapreduce.WALInputFormat: Opening reader for hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643
(-9223372036854775808:9223372036854775807) length:132679657
> 2015-06-07 06:50:16,383 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search:
Loaded keys to find: count=2384870 2015-06-07 06:50:16,607 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search:
Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:prev/1433637873293/Put/vlen=16/seqid=0
2015-06-07 06:50:16,663 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search:
Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:count/1433637873293/Put/vlen=8/seqid=0
> 2015-06-07 06:50:16,664 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search:
Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:client/1433637873293/Put/vlen=71/seqid=0
2015-06-07 06:50:16,671 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search:
Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12"\x04/meta:prev/1433637873293/Put/vlen=16/seqid=0
> 2015-06-07 06:50:16,672 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search:
Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12"\x04/meta:count/1433637873293/Put/vlen=8/seqid=0
2015-06-07 06:50:16,678 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search:
Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12"\x04/meta:client/1433637873293/Put/vlen=71/seqid=0
> 2015-06-07 06:50:16,679 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search:
Found cell=\x1F\x80#\x95\xAE:i=)S\x01\x08 \xD6\xFF\xD5/meta:prev/1433637873293/Put/vlen=16/seqid=0
> {code}
> {{hlog -p}} confirms the missing keys are there in the file: 
> {code}
> Sequence=7276 , region=b086e29f909c9790446cd457c1ea7674 at write timestamp=Sun Jun 07
00:44:33 UTC 2015
> row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:prev
>     value: \x1B\xF5\xF3^\x8D\xB4\x85\xE3\xF4wS\x9A]\x0D\xABe
> row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:count
>     value: \x00\x00\x00\x00\x002\x87l
> row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:client
>     value: Job: job_1433466891829_0002 Task: attempt_1433466891829_0002_m_000003_0
> {code}
> When the RS gets killing from CM, master does WAL splitting: 
> {code}
> ./hbase-hbase-regionserver-os-enis-dal-test-jun-4-2.log:2015-06-07 00:46:12,581 INFO
 [RS_LOG_REPLAY_OPS-os-enis-dal-test-jun-4-2:16020-0] wal.WALSplitter: Processed 2971 edits
across 4 regions; edits skipped=740; log file=hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-4.openstacklocal,16020,1433636320201-splitting/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643,
length=132679657, corrupted=false, progress failed=false
> {code}
> The edits with Sequence=7276 should be in this recovered.edits file: 
> {code}
> 2015-06-07 00:46:12,574 INFO  [split-log-closeStream-2] wal.WALSplitter: Closed wap hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0000000000000007276.temp
(wrote 739 edits in 3950ms)
> 2015-06-07 00:46:12,580 INFO  [split-log-closeStream-2] wal.WALSplitter: Rename hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0000000000000007276.temp
to hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0000000000000008752
> {code}
> Notice that the file {{0000000000000008752}} should contain {{(wrote 739 edits in 3950ms)}}.

> When the region gets opened again, it replays the recovered.edits files:
> {code}
> 2015-06-07 00:46:17,764 DEBUG [RS_OPEN_REGION-os-enis-dal-test-jun-4-5:16020-0] regionserver.HRegion:
Found 4 recovered edits file(s) under hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674
> {code}
> However, it does not see ANY entry in the recovered.edits file  {{0000000000000008752}}:

> {code}
> 2015-06-07 00:46:17,796 DEBUG [RS_OPEN_REGION-os-enis-dal-test-jun-4-5:16020-0] regionserver.HRegion:
Applied 0, skipped 0, firstSequenceIdInLog=-1, maxSequenceIdInLog=-1, path=hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0000000000000008752
> {code}
> Looking at the length of the file, it is very small which means that the edits are not
written. 
> Now, If you are still with me, I think the reason why we do not write any edit to {{0000000000000008752}}
although we say we have written 739 edits is due to HBASE-12405. 
> Specifically, we pass {{FSHLog.getEarliestMemstoreSeqNum() - 1}} as the stores flush
seqId, which gets saved in the master and used with {{LastSequenceId}}.  
> The WALSplitter skips the cells per store using {{filterCellByStore()}} which is logical.
However, the data loss happens, because   {{FSHLog.getEarliestMemstoreSeqNum()}} ignores flushing
edits. For example, if we have seqId = 10 to 100 as flushing, and seqId 101 in the new memstore,
and heartbeat to master during the flush will carry 101 as the earliest flush seqId. if the
flush does not complete (due to RS getting killed) this results in incorrectly skipped edits.

> The WAL replay code also uses this kind of filtering so this may also explain some of
the failures with DLR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message