accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "William Slacum (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-2261) duplicate locations
Date Thu, 06 Feb 2014 17:16:10 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13893550#comment-13893550
] 

William Slacum commented on ACCUMULO-2261:
------------------------------------------

I've seen this twice in the past two days on a 1.5.0 instance on AWS. The first time I could
repair it because it was on a user table, but now this happened on the !METADATA table.

I'm unable to scan the !METADATA table, so I had to offline it and dig through some of its
RFiles. In the {{root_tablet}} directory, I found:

{noformat}
!!~del/!0/default_tablet/A00006zz.rf : [] 8189 false ->
!!~del/!0/table_info/A00006zx.rf : [] 8186 false ->
!!~del/!0/table_info/F0000704.rf : [] 8187 false ->
!0;!0< srv:dir [] 0 false -> /root_tablet
!0;!0< ~tab:~pr [] 0 false -> ^@
!0;~ file:/table_info/A0000706.rf [] 8188 false -> 4457,270
!0;~ last:144060a192d001b [] 8188 false -> 10.157.33.251:9997
!0;~ loc:144060a192d001b [] 7251 false -> 10.157.33.251:9997
!0;~ loc:244060a193e001d [] 7256 false -> 10.157.42.152:9997
!0;~ srv:compact [] 8188 false -> 673
!0;~ srv:dir [] 0 false -> /table_info
!0;~ srv:flush [] 8185 false -> 674
!0;~ srv:lock [] 8188 false -> tservers/10.157.33.251:9997/zlock-0000000003$144060a192d001b
!0;~ srv:time [] 8185 false -> L3060
!0;~ ~tab:~pr [] 0 false -> ^A!0<
!0< file:/default_tablet/A0000707.rf [] 8190 false -> 310,3
!0< last:144060a192d001b [] 8190 false -> 10.157.33.251:9997
!0< loc:144060a192d001b [] 7244 false -> 10.157.33.251:9997
!0< srv:compact [] 8190 false -> 674
!0< srv:dir [] 0 false -> /default_tablet
!0< srv:flush [] 8184 false -> 674
!0< srv:lock [] 8190 false -> tservers/10.157.33.251:9997/zlock-0000000003$144060a192d001b
!0< srv:time [] 7624 false -> L2284
!0< ~tab:~pr [] 0 false -> ^A~
{noformat}

Unfortunately I don't have any logs due to a configuration issue. I do believe that both of
those servers were actually alive at the time the issue was noticed. I talked to [~ecn] yesterday
when this was on a user table, and he told me if both servers are alive, I should delete both
{{loc}} entries. What are the effects of doing that? Are they different when they occur on
the root tablet?

> duplicate locations
> -------------------
>
>                 Key: ACCUMULO-2261
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2261
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master, tserver
>    Affects Versions: 1.5.0
>         Environment: hadoop 2.2.0 and zookeeper 3.4.5
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>             Fix For: 1.5.1, 1.6.0
>
>
> Anthony F reports the following:
> bq. I have observed a loss of data when tservers fail during bulk ingest.  The keys that
are missing are right around the table's splits indicating that data was lost when a tserver
died during a split.  I am using Accumulo 1.5.0.  At around the same time, I observe the master
logging a message about "Found two locations for the same extent". 
> And:
> bq.  I'm currently digging through the logs and will report back.  Keep in mind, I'm
using Accumulo 1.5.0 on a Hadoop 2.2.0 stack.  To determine data loss, I have a 'ConsistencyCheckingIterator'
that verifies each row has the expected data (it takes a long time to scan the whole table).
 Below is a quick summary of what happened.  The tablet in question is "d;72~gcm~201304".
 Notice that it is assigned to 192.168.2.233:9997[343bc1fa155242c] at 2014-01-25 09:49:36,233.
 At 2014-01-25 09:49:54,141, the tserver goes away.  Then, the tablet gets assigned to 192.168.2.223:9997[143bc1f14412432]
and shortly after that, I see the BadLocationStateException.  The master never recovers from
the BLSE - I have to manually delete one of the offending locations.
> {noformat}
> 2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning tablet d;72~gcm~201304;72=192.168.2.233:9997[343bc1fa155242c]
> 2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning tablet p;18~thm~2012101;18=192.168.2.233:9997[343bc1fa155242c]
> 2014-01-25 09:49:54,141 [master.Master] WARN : Lost servers [192.168.2.233:9997[343bc1fa155242c]]
> 2014-01-25 09:49:56,866 [master.Master] DEBUG: 42 assigned to dead servers: [d;03~u36~201302;03~thm~2012091@(null,192.168.2.233:9997[343bc1fa155242c],null),
d;06~u36~2013;06~thm~2012083@(null,192.168.2.233:9997[343bc1fa155242c],null), d;25;24~u36~2013@(null,192.168.2.233:9997[343bc1fa155242c],null),
d;25~u36~201303;25~thm~201209@(null,192.168.2.233:9997[343bc1fa155242c],null), d;27~gcm~2013041;27@(null,192.168.2.233:9997[343bc1fa155242c],null),
d;30~u36~2013031;30~thm~2012082@(null,192.168.2.233:9997[343bc1fa155242c],null), d;34~thm;34~gcm~2013022@(null,192.168.2.233:9997[343bc1fa155242c],null),
d;39~thm~20121;39~gcm~20130418@(null,192.168.2.233:9997[343bc1fa155242c],null), d;41~thm;41~gcm~2013041@(null,192.168.2.233:9997[343bc1fa155242c],null),
d;42~u36~201304;42~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null), d;45~thm~201208;45~gcm~201303@(null,192.168.2.233:9997[343bc1fa155242c],null),
d;48~gcm~2013052;48@(null,192.168.2.233:9997[343bc1fa155242c],null), d;60~u36~2013021;60~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null),
d;68~gcm~2013041;68@(null,192.168.2.233:9997[343bc1fa155242c],null), d;72;71~u36~2013@(null,192.168.2.233:9997[343bc1fa155242c],null),
d;72~gcm~201304;72@(192.168.2.233:9997[343bc1fa155242c],null,null), d;75~thm~2012101;75~gcm~2013032@(null,192.168.2.233:9997[343bc1fa155242c],null),
d;78;77~u36~201305@(null,192.168.2.233:9997[343bc1fa155242c],null), d;90~u36~2013032;90~thm~2012092@(null,192.168.2.233:9997[343bc1fa155242c],null),
d;91~thm;91~gcm~201304@(null,192.168.2.233:9997[343bc1fa155242c],null), d;93~u36~2013012;93~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null),
m;20;19@(null,192.168.2.233:9997[343bc1fa155242c],null), m;38;37@(null,192.168.2.233:9997[343bc1fa155242c],null),
m;51;50@(null,192.168.2.233:9997[343bc1fa155242c],null), m;60;59@(null,192.168.2.233:9997[343bc1fa155242c],null),
m;92;91@(null,192.168.2.233:9997[343bc1fa155242c],null), o;01<@(null,192.168.2.233:9997[343bc1fa155242c],null),
o;04;03@(null,192.168.2.233:9997[343bc1fa155242c],null), o;50;49@(null,192.168.2.233:9997[343bc1fa155242c],null),
o;63;62@(null,192.168.2.233:9997[343bc1fa155242c],null), o;74;73@(null,192.168.2.233:9997[343bc1fa155242c],null),
o;97;96@(null,192.168.2.233:9997[343bc1fa155242c],null), p;08~thm~20121;08@(null,192.168.2.233:9997[343bc1fa155242c],null),
p;09~thm~20121;09@(null,192.168.2.233:9997[343bc1fa155242c],null), p;10;09~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null),
p;18~thm~2012101;18@(192.168.2.233:9997[343bc1fa155242c],null,null), p;21;20~thm~201209@(null,192.168.2.233:9997[343bc1fa155242c],null),
p;22~thm~2012091;22@(null,192.168.2.233:9997[343bc1fa155242c],null), p;23;22~thm~2012091@(null,192.168.2.233:9997[343bc1fa155242c],null),
p;41~thm~2012111;41@(null,192.168.2.233:9997[343bc1fa155242c],null), p;42;41~thm~2012111@(null,192.168.2.233:9997[343bc1fa155242c],null),
p;58~thm~201208;58@(null,192.168.2.233:9997[343bc1fa155242c],null)]...
> 2014-01-25 09:49:59,706 [master.Master] DEBUG: Normal Tablets assigning tablet d;72~gcm~201304;72=192.168.2.223:9997[143bc1f14412432]
> 2014-01-25 09:50:13,515 [master.EventCoordinator] INFO : tablet d;72~gcm~201304;72 was
loaded on 192.168.2.223:9997
> 2014-01-25 09:51:20,058 [state.MetaDataTableScanner] ERROR: java.lang.RuntimeException:
org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: found
two locations for the same extent d;72~gcm~201304: 192.168.2.223:9997[143bc1f14412432] and
192.168.2.233:9997[343bc1fa155242c]
> java.lang.RuntimeException: org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
found two locations for the same extent d;72~gcm~201304: 192.168.2.223:9997[143bc1f14412432]
and 192.168.2.233:9997[343bc1fa155242c]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message