accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-1831) Write ahead logs from upgrade prematurely GCed
Date Mon, 18 Nov 2013 14:37:20 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825352#comment-13825352
] 

Eric Newton commented on ACCUMULO-1831:
---------------------------------------

This should be governed by {{master.recovery.max.age}}.  That is, we don't really GC recovery
files, we just remove them when they have been sitting there for an hour (by default).  Did
you set this to some very low value?


> Write ahead logs from upgrade prematurely GCed
> ----------------------------------------------
>
>                 Key: ACCUMULO-1831
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1831
>             Project: Accumulo
>          Issue Type: Sub-task
>          Components: master, tserver
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>            Priority: Blocker
>             Fix For: 1.6.0
>
>
> I was running {{test/system/upgrade_test.sh dirty}} and the test hung.  Upon inspection,
the wals from 1.5 were deleted before all tablets were recovered.   
> Some tablets from 1.5 recovered fine.
> {noformat}
> 2013-10-29 20:29:26,475 [log.SortedLogRecovery] INFO : Recovery complete for !!R<<
using hdfs://nnhost:6093/rktl/accumulo-upt/recovery/754f171b-c260-42dd-b17e-bd15064608c7
> {noformat}
> Then the GC kicked in and deleted files before tablets were finished recovering.
> {noformat}
> 2013-10-29 20:29:30,421 [gc.GarbageCollectWriteAheadLogs] DEBUG: Removing WAL for offline
server hdfs://nnhost:6093/rktl/accumulo-upt/wal/127.0.0.1+9997/754f171b-c260-42dd-b17e-bd15064608c7
> 2013-10-29 20:29:30,428 [gc.GarbageCollectWriteAheadLogs] DEBUG: Removing sorted WAL
hdfs://nnhost:6093/rktl/accumulo-upt/recovery/754f171b-c260-42dd-b17e-bd15064608c7
> {noformat}
> Tablet failed to recover.
> {noformat}
> 2013-10-29 20:29:30,858 [tabletserver.TabletServer] WARN : exception trying to assign
tablet 1<;row_0000180000 /default_tablet
> java.lang.RuntimeException: java.io.IOException: Unable to find recovery files for extent
1<;row_0000180000 logEntry: 1<; 754f171b-c260-42dd-b17e-bd15064608c7 (19)
>         at org.apache.accumulo.server.tabletserver.Tablet.<init>(Tablet.java:1398)
>         at org.apache.accumulo.server.tabletserver.Tablet.<init>(Tablet.java:1233)
>         at org.apache.accumulo.server.tabletserver.Tablet.<init>(Tablet.java:1088)
>         at org.apache.accumulo.server.tabletserver.Tablet.<init>(Tablet.java:1076)
> {noformat}
> I had set my gc delay to 30 secs while testing another issue and thats why I ran into
this issue.   
> Looking at the code, I do not think its properly converting relative paths from 1.5 to
absolute paths.   I think the code should convert everything to relative paths (just UUIDs)
to avoid problems caused by differing configurations.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message