accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-2261) duplicate locations
Date Tue, 28 Jan 2014 01:04:37 GMT


Eric Newton commented on ACCUMULO-2261:

[~anthonyf]] are you sure that the other processes on your nodes wouldn't push a tserver into
swap?  If you are really this unlucky, I want you to help test all future releases. 

I've written a fix using conditional mutations, but I'm going to leave it for 1.7.  Conditional
mutations are new and the METADATA table is a tricky beast.  And we still need a fix for the
root table pointer.   Instead, I'll have the master repair the metadata information if it
finds it is relatively safe to do so.  So, if it sees multiple locations, and one is a dead
tserver, it will remove the entry.  I'll back-port it to 1.5, too.

I'm decreasing the severity because there is a known work-around (removing the entry in the
!METADATA table by hand).  Feel free to raise it back up if you disagree.

> duplicate locations
> -------------------
>                 Key: ACCUMULO-2261
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master, tserver
>    Affects Versions: 1.5.0
>         Environment: hadoop 2.2.0 and zookeeper 3.4.5
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>             Fix For: 1.5.1
> Anthony F reports the following:
> bq. I have observed a loss of data when tservers fail during bulk ingest.  The keys that
are missing are right around the table's splits indicating that data was lost when a tserver
died during a split.  I am using Accumulo 1.5.0.  At around the same time, I observe the master
logging a message about "Found two locations for the same extent". 
> And:
> bq.  I'm currently digging through the logs and will report back.  Keep in mind, I'm
using Accumulo 1.5.0 on a Hadoop 2.2.0 stack.  To determine data loss, I have a 'ConsistencyCheckingIterator'
that verifies each row has the expected data (it takes a long time to scan the whole table).
 Below is a quick summary of what happened.  The tablet in question is "d;72~gcm~201304".
 Notice that it is assigned to[343bc1fa155242c] at 2014-01-25 09:49:36,233.
 At 2014-01-25 09:49:54,141, the tserver goes away.  Then, the tablet gets assigned to[143bc1f14412432]
and shortly after that, I see the BadLocationStateException.  The master never recovers from
the BLSE - I have to manually delete one of the offending locations.
> {noformat}
> 2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning tablet d;72~gcm~201304;72=[343bc1fa155242c]
> 2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning tablet p;18~thm~2012101;18=[343bc1fa155242c]
> 2014-01-25 09:49:54,141 [master.Master] WARN : Lost servers [[343bc1fa155242c]]
> 2014-01-25 09:49:56,866 [master.Master] DEBUG: 42 assigned to dead servers: [d;03~u36~201302;03~thm~2012091@(null,[343bc1fa155242c],null),
d;06~u36~2013;06~thm~2012083@(null,[343bc1fa155242c],null), d;25;24~u36~2013@(null,[343bc1fa155242c],null),
d;25~u36~201303;25~thm~201209@(null,[343bc1fa155242c],null), d;27~gcm~2013041;27@(null,[343bc1fa155242c],null),
d;30~u36~2013031;30~thm~2012082@(null,[343bc1fa155242c],null), d;34~thm;34~gcm~2013022@(null,[343bc1fa155242c],null),
d;39~thm~20121;39~gcm~20130418@(null,[343bc1fa155242c],null), d;41~thm;41~gcm~2013041@(null,[343bc1fa155242c],null),
d;42~u36~201304;42~thm~20121@(null,[343bc1fa155242c],null), d;45~thm~201208;45~gcm~201303@(null,[343bc1fa155242c],null),
d;48~gcm~2013052;48@(null,[343bc1fa155242c],null), d;60~u36~2013021;60~thm~20121@(null,[343bc1fa155242c],null),
d;68~gcm~2013041;68@(null,[343bc1fa155242c],null), d;72;71~u36~2013@(null,[343bc1fa155242c],null),
d;72~gcm~201304;72@([343bc1fa155242c],null,null), d;75~thm~2012101;75~gcm~2013032@(null,[343bc1fa155242c],null),
d;78;77~u36~201305@(null,[343bc1fa155242c],null), d;90~u36~2013032;90~thm~2012092@(null,[343bc1fa155242c],null),
d;91~thm;91~gcm~201304@(null,[343bc1fa155242c],null), d;93~u36~2013012;93~thm~20121@(null,[343bc1fa155242c],null),
m;20;19@(null,[343bc1fa155242c],null), m;38;37@(null,[343bc1fa155242c],null),
m;51;50@(null,[343bc1fa155242c],null), m;60;59@(null,[343bc1fa155242c],null),
m;92;91@(null,[343bc1fa155242c],null), o;01<@(null,[343bc1fa155242c],null),
o;04;03@(null,[343bc1fa155242c],null), o;50;49@(null,[343bc1fa155242c],null),
o;63;62@(null,[343bc1fa155242c],null), o;74;73@(null,[343bc1fa155242c],null),
o;97;96@(null,[343bc1fa155242c],null), p;08~thm~20121;08@(null,[343bc1fa155242c],null),
p;09~thm~20121;09@(null,[343bc1fa155242c],null), p;10;09~thm~20121@(null,[343bc1fa155242c],null),
p;18~thm~2012101;18@([343bc1fa155242c],null,null), p;21;20~thm~201209@(null,[343bc1fa155242c],null),
p;22~thm~2012091;22@(null,[343bc1fa155242c],null), p;23;22~thm~2012091@(null,[343bc1fa155242c],null),
p;41~thm~2012111;41@(null,[343bc1fa155242c],null), p;42;41~thm~2012111@(null,[343bc1fa155242c],null),
> 2014-01-25 09:49:59,706 [master.Master] DEBUG: Normal Tablets assigning tablet d;72~gcm~201304;72=[143bc1f14412432]
> 2014-01-25 09:50:13,515 [master.EventCoordinator] INFO : tablet d;72~gcm~201304;72 was
loaded on
> 2014-01-25 09:51:20,058 [state.MetaDataTableScanner] ERROR: java.lang.RuntimeException:
org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: found
two locations for the same extent d;72~gcm~201304:[143bc1f14412432] and[343bc1fa155242c]
> java.lang.RuntimeException: org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
found two locations for the same extent d;72~gcm~201304:[143bc1f14412432]
> {noformat}

This message was sent by Atlassian JIRA

View raw message