accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony F <>
Subject Re: data loss around splits when tserver goes down
Date Mon, 27 Jan 2014 14:09:26 GMT
Eric, I'm currently digging through the logs and will report back.  Keep in
mind, I'm using Accumulo 1.5.0 on a Hadoop 2.2.0 stack.  To determine data
loss, I have a 'ConsistencyCheckingIterator' that verifies each row has the
expected data (it takes a long time to scan the whole table).  Below is a
quick summary of what happened.  The tablet in question is
"d;72~gcm~201304".  Notice that it is assigned to[343bc1fa155242c]
at 2014-01-25 09:49:36,233.  At 2014-01-25 09:49:54,141, the tserver goes
away.  Then, the tablet gets assigned to[143bc1f14412432]
and shortly after that, I see the BadLocationStateException.  The master
never recovers from the BLSE - I have to manually delete one of the
offending locations.

2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning
tablet d;72~gcm~201304;72=[343bc1fa155242c]
2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning
tablet p;18~thm~2012101;18=[343bc1fa155242c]
2014-01-25 09:49:54,141 [master.Master] WARN : Lost servers
2014-01-25 09:49:56,866 [master.Master] DEBUG: 42 assigned to dead servers:
m;20;19@(null,[343bc1fa155242c],null), m;38;37@
(null,[343bc1fa155242c],null), m;51;50@
(null,[343bc1fa155242c],null), m;60;59@
(null,[343bc1fa155242c],null), m;92;91@
o;01<@(null,[343bc1fa155242c],null), o;04;03@
(null,[343bc1fa155242c],null), o;50;49@
(null,[343bc1fa155242c],null), o;63;62@
(null,[343bc1fa155242c],null), o;74;73@
(null,[343bc1fa155242c],null), o;97;96@
(null,[343bc1fa155242c],null), p;08~thm~20121;08@
(null,[343bc1fa155242c],null), p;09~thm~20121;09@
(null,[343bc1fa155242c],null), p;10;09~thm~20121@
(null,[343bc1fa155242c],null), p;18~thm~2012101;18@
([343bc1fa155242c],null,null), p;21;20~thm~201209@
(null,[343bc1fa155242c],null), p;22~thm~2012091;22@
(null,[343bc1fa155242c],null), p;23;22~thm~2012091@
(null,[343bc1fa155242c],null), p;41~thm~2012111;41@
(null,[343bc1fa155242c],null), p;42;41~thm~2012111@
(null,[343bc1fa155242c],null), p;58~thm~201208;58@
2014-01-25 09:49:59,706 [master.Master] DEBUG: Normal Tablets assigning
tablet d;72~gcm~201304;72=[143bc1f14412432]
2014-01-25 09:50:13,515 [master.EventCoordinator] INFO : tablet
d;72~gcm~201304;72 was loaded on
2014-01-25 09:51:20,058 [state.MetaDataTableScanner] ERROR:
found two locations for the same extent d;72~gcm~201304:[143bc1f14412432]
found two locations for the same extent d;72~gcm~201304:[143bc1f14412432]

On Mon, Jan 27, 2014 at 8:53 AM, Eric Newton <> wrote:

> Having two "last" locations... is annoying, and useless.  Having two "loc"
> locations is disastrous.  We do a *lot* of testing that verifies that data
> is not lost, with live ingest and with bulk ingest, and just about every
> other condition you can imagine.  Presently, this testing is being done by
> me for 1.6.0 on Hadoop 2.2.0 and ZK 3.4.5.
> If you can provide any of the following, it would be helpful:
> * an automated test case that demonstrates the problem
> * logs that document what happened
> * a description of the *exact* things you did to detect data loss
> Please don't use the approximate counts displayed on the monitor pages to
> confirm ingest.  These are known to be incorrect with both bulk ingested
> data and right after splits.  The data is there, but the counts are just
> estimates.
> If you find you have verified data loss, please open a ticket, and provide
> as many details as you can, even if it does not happen consistently.
> Thanks!
> -Eric
> On Mon, Jan 27, 2014 at 7:57 AM, Anthony F <> wrote:
>> I took a look in the code . . . the stack trace is not quite the same.
>>  In 1.6.0, the fixed issue related to METADATA_LAST_LOCATION_COLUMN_FAMILY.
>>  The issue I am seeing (in 1.5.0) is related to
>> On Sun, Jan 26, 2014 at 7:00 PM, Anthony F <> wrote:
>>> The stack trace is pretty close and the steps to reproduce match the
>>> scenario in which I observed the issue.  But there's no fix (in Jira)
>>> against 1.5.0, just 1.6.0.
>>> On Sun, Jan 26, 2014 at 5:56 PM, Josh Elser <>wrote:
>>>> Just because the error message is the same doesn't mean that the root
>>>> cause is also the same.
>>>> Without looking more into Eric's changes, I'm not sure if ACCUMULO-2057
>>>> would also affect 1.5.0. We're usually pretty good about checking backwards
>>>> when bugs are found in newer versions, but things slip through the cracks,
>>>> too.
>>>> On 1/26/2014 5:09 PM, Anthony F wrote:
>>>>> This is pretty much the issue:
>>>>> Slightly different error message but it's a different version.  Looks
>>>>> like its fixed in 1.6.0.  I'll probably need to upgrade.
>>>>> On Sun, Jan 26, 2014 at 4:47 PM, Anthony F <
>>>>> <>> wrote:
>>>>>     Thanks, I'll check Jira.  As for versions, Hadoop 2.2.0, Zk 3.4.5,
>>>>>     CentOS 64bit (kernel 2.6.32-431.el6.x86_64).  Has much testing been
>>>>>     done using Hadoop 2.2.0?  I tried Hadoop 2.0.0 (CDH 4.5.0) but ran
>>>>>     into HDFS-5225/5031 which basically makes it a non-starter.
>>>>>     On Sun, Jan 26, 2014 at 4:29 PM, Josh Elser <
>>>>>     <>> wrote:
>>>>>         I meant to reply to your original email, but I didn't yet,
>>>>> sorry.
>>>>>         First off, if Accumulo is reporting that it found multiple
>>>>>         locations for the same extent, this is a (very bad) bug in
>>>>>         Accumulo. It might be worth looking at tickets that at marked
>>>>> as
>>>>>         "affects 1.5.0" and "fixed in 1.5.1" on Jira. It's likely that
>>>>>         we've already encountered and fixed the issue, but, if you
>>>>> can't
>>>>>         find a fix that was already made, we don't want to overlook the
>>>>>         potential need for one.
>>>>>         For both "live" and "bulk" ingest, *neither* should lose any
>>>>>         data. This is one thing that Accumulo should never be doing.
>>>>>         you have multiple locations for an extent, it seems plausible
>>>>> to
>>>>>         me that you would run into data loss. However, you should focus
>>>>>         on trying to determine why you keep running into multiple
>>>>>         locations for a tablet.
>>>>>         After you take a look at Jira, I would likely go ahead and file
>>>>>         a jira to track this since it's easier to follow than an email
>>>>>         thread. Be sure to note if there is anything notable about your
>>>>>         installation (did you download it directly from the
>>>>> <> site)?
>>>>>         should also include what OS and version and what Hadoop and
>>>>>         ZooKeeper versions you are running.
>>>>>         On 1/26/2014 4:10 PM, Anthony F wrote:
>>>>>             I have observed a loss of data when tservers fail during
>>>>>             bulk ingest.
>>>>>             The keys that are missing are right around the table's
>>>>>             splits indicating
>>>>>             that data was lost when a tserver died during a split.  I
>>>>> am
>>>>>             using
>>>>>             Accumulo 1.5.0.  At around the same time, I observe the
>>>>>             master logging a
>>>>>             message about "Found two locations for the same extent".
>>>>>               Can anyone
>>>>>             shed light on this behavior?  Are tserver failures during
>>>>>             bulk ingest
>>>>>             supposed to be fault tolerant?

View raw message