accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-334) Bulk random walk test failed
Date Fri, 20 Jan 2012 16:43:47 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189883#comment-13189883
] 

Keith Turner commented on ACCUMULO-334:
---------------------------------------

Ran the test overnight and saw the bug again.  Have not looked into it in detail.  Billie
helped me find the offendening ranges easily w/ the following shell command.  John assisted
w/ the not zero.

{noformat}
/bin/accumulo shell -u root -p secret -e 'scan -t bulk_xxx_4807_1327038843548 -np' | egrep
-C1 "[^0]$" | grep -C1 " 0"
r17ecb cf:009 []    0
r17ecc cf:000 []    1
--
r18132 cf:009 []    1
r18133 cf:000 []    0
{noformat}

I thought this was cool enough to post.  Yesterday there were two bad ranges.  Today there
is one.  Although yesterday I only looked for -1, so there could have been more.  This command
looks for any non-zero values.

                
> Bulk random walk test failed
> ----------------------------
>
>                 Key: ACCUMULO-334
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-334
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test
>         Environment: Running random walk test against 1.4.0-SNAPSHOT on 10 node cluster
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.4.0
>
>
> The bulk random walk test failed while running on a 10 node cluster w/ the following
error message.
> {noformat}
> 18 23:36:05,167 [bulk.Setup] INFO : Starting bulk test on 459a04a0
> 19 00:24:33,950 [randomwalk.Framework] ERROR: Error during random walk
> java.lang.Exception: Error running node Bulk.xml
>         at org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253)
>         at org.apache.accumulo.server.test.randomwalk.Framework.run(Framework.java:61)
>         at org.apache.accumulo.server.test.randomwalk.Framework.main(Framework.java:114)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.accumulo.start.Main$1.run(Main.java:89)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.Exception: Error running node bulk.Verify
>         at org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253)
>         at org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249)
>         ... 8 more
> Caused by: java.lang.Exception: Bad key at r0d646 cf:000 [] 1326932285943 false -1
>         at org.apache.accumulo.server.test.randomwalk.bulk.Verify.visit(Verify.java:51)
>         at org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249)
>         ... 9 more
> {noformat}
> Looking at the table the rows [r0d646, r0edd9] and [r0f056, r10467] all had -1 values.
 There was a tablet that overlapped the first range of -1 rows exactly 268;r0edd9;r0d645.
 This tablet had only the following activity on a tablet server and was then merged out of
existence.  The merge operation was 268;r10eff;r093b1.
> {noformat}
> 19 00:05:10,966 [tabletserver.Tablet] DEBUG: Files for low split 268;r0edd9;r0d645  [/b-0001azp/I0001azt.rf,
/b-0001azp/I0001azu.rf, /t-0001ale/A0001an3.rf]
> 19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split 268;r0edd9;r0d645
268;r0f055;r0edd9
> 19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 opened 
> 19 00:05:15,029 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import /b-0001azi/I0001azm.rf
17138 0
> 19 00:05:15,103 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645 [/b-0001azi/I0001azm.rf,
/b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, /t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf_tmp
> 19 00:05:15,339 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import /b-0001azx/I0001azy.rf
16620 0
> 19 00:05:15,651 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645 181,080
read | 60,360 written | 553,761 entries/sec |  0.327 secs
> 19 00:05:15,661 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC [/b-0001azi/I0001azm.rf,
/b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, /t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf
> 19 00:05:30,672 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645 [/b-0001azx/I0001azy.rf]
--> /t-0001apj/C0001brn.rf_tmp
> 19 00:05:30,810 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645 60,360 read
| 60,360 written | 534,159 entries/sec |  0.113 secs
> 19 00:05:30,824 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC [/b-0001azx/I0001azy.rf]
--> /t-0001apj/C0001brn.rf
> 19 00:05:30,943 [tabletserver.Tablet] DEBUG: initiateClose(saveState=true queueMinC=false
disableWrites=false) 268;r0edd9;r0d645
> 19 00:05:30,943 [tabletserver.Tablet] DEBUG: completeClose(saveState=true completeClose=true)
268;r0edd9;r0d645
> 19 00:05:30,947 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 closed
> 19 00:05:30,947 [tabletserver.TabletServer] DEBUG: Unassigning 268;r0edd9;r0d645@(null,xxx.xxx.xxx.xxx:9997[134d7425fc59413],null)
> 19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645
> 19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645
> {noformat}
> For the second range of -1 values [r0f056, r10467], r0f056 corresponds to the split point
r0f055.  Howerver, there is no split point corresponding to r10467. All of the tablets w/
a split of r0f055 lived on one tablet server.  
> {noformat}
> 19 00:02:21,262 [tabletserver.Tablet] TABLET_HIST: 268<;r0d645 split 268;r0f055;r0d645
268<;r0f055
> 19 00:02:21,263 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 opened 
> 19 00:02:21,264 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 opened 
> 19 00:02:44,504 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 split 268;r11da6;r0f055
268<;r11da6
> 19 00:02:44,505 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 opened 
> 19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split 268;r0edd9;r0d645
268;r0f055;r0edd9
> 19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0edd9 opened 
> 19 00:05:15,023 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 split 268;r0f622;r0f055
268;r11da6;r0f622
> 19 00:05:15,024 [tabletserver.Tablet] TABLET_HIST: 268;r0f622;r0f055 opened 
> {noformat}
> All of the tablets mentioned so far were all merged away in the same merge operation,
making this operation a possible place were data loss occurred.  However, I can not pinpoint
the issue at this point in time.  Below is a little info about the merge from the master logs
showing which tablets were involved in the merge.
> {noformat}
> 19 00:05:30,616 [master.EventCoordinator] INFO : Merge state of 268;r10eff;r093b1 set
to WAITING_FOR_CHOPPED
> 19 00:05:30,677 [master.Master] INFO : Asking xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to
chop 268;r09927;r0903a
> 19 00:05:30,678 [master.Master] INFO : Asking xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to
chop 268;r0ca9e;r09927
> 19 00:05:30,678 [master.Master] INFO : Asking xxx.xxx.xxx.xxx:9997[134d7425fc5940a] to
chop 268;r0d2b5;r0ca9e
> 19 00:05:30,678 [master.Master] INFO : Asking xxx.xxx.xxx.xxx:9997[134d7425fc59412] to
chop 268;r0d645;r0d2b5
> 19 00:05:30,678 [master.Master] INFO : Asking xxx.xxx.xxx.xxx:9997[134d7425fc59413] to
chop 268;r0edd9;r0d645
> 19 00:05:30,678 [master.Master] INFO : Asking xxx.xxx.xxx.xxx:9997[134d7425fc59413] to
chop 268;r0f055;r0edd9
> 19 00:05:30,678 [master.Master] INFO : Asking xxx.xxx.xxx.xxx:9997[134d7425fc59413] to
chop 268;r0f622;r0f055
> 19 00:05:30,678 [master.Master] INFO : Asking xxx.xxx.xxx.xxx:9997[134d7425fc59413] to
chop 268;r0f68b;r0f622
> 19 00:05:30,678 [master.Master] INFO : Asking xxx.xxx.xxx.xxx:9997[134d7425fc59413] to
chop 268;r10c14;r0f68b
> 19 00:05:30,678 [master.Master] INFO : Asking xxx.xxx.xxx.xxx:9997[134d7425fc59413] to
chop 268;r110f7;r10c14
> {noformat}
> When this test verifies its data and detects data loss, there is no easy way to determine
at what time the data loss occurred.  It might be useful to modify the data in the bulk test
such that it is easier to determine the time when data was lost.  For example the continuous
ingest test creates linked list and it is possible to determine tight time bounds when a node
was ingested.  However that may change the nature of this test and the bugs that it might
find.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message