accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <eric.new...@gmail.com>
Subject Re: data loss around splits when tserver goes down
Date Mon, 27 Jan 2014 13:53:23 GMT
Having two "last" locations... is annoying, and useless.  Having two "loc"
locations is disastrous.  We do a *lot* of testing that verifies that data
is not lost, with live ingest and with bulk ingest, and just about every
other condition you can imagine.  Presently, this testing is being done by
me for 1.6.0 on Hadoop 2.2.0 and ZK 3.4.5.

If you can provide any of the following, it would be helpful:

* an automated test case that demonstrates the problem
* logs that document what happened
* a description of the *exact* things you did to detect data loss

Please don't use the approximate counts displayed on the monitor pages to
confirm ingest.  These are known to be incorrect with both bulk ingested
data and right after splits.  The data is there, but the counts are just
estimates.

If you find you have verified data loss, please open a ticket, and provide
as many details as you can, even if it does not happen consistently.

Thanks!

-Eric



On Mon, Jan 27, 2014 at 7:57 AM, Anthony F <afccri@gmail.com> wrote:

> I took a look in the code . . . the stack trace is not quite the same.  In
> 1.6.0, the fixed issue related to METADATA_LAST_LOCATION_COLUMN_FAMILY.
>  The issue I am seeing (in 1.5.0) is related to
> METADATA_CURRENT_LOCATION_COLUMN_FAMILY (line 144).
>
>
> On Sun, Jan 26, 2014 at 7:00 PM, Anthony F <afccri@gmail.com> wrote:
>
>> The stack trace is pretty close and the steps to reproduce match the
>> scenario in which I observed the issue.  But there's no fix (in Jira)
>> against 1.5.0, just 1.6.0.
>>
>>
>> On Sun, Jan 26, 2014 at 5:56 PM, Josh Elser <josh.elser@gmail.com> wrote:
>>
>>> Just because the error message is the same doesn't mean that the root
>>> cause is also the same.
>>>
>>> Without looking more into Eric's changes, I'm not sure if ACCUMULO-2057
>>> would also affect 1.5.0. We're usually pretty good about checking backwards
>>> when bugs are found in newer versions, but things slip through the cracks,
>>> too.
>>>
>>>
>>> On 1/26/2014 5:09 PM, Anthony F wrote:
>>>
>>>> This is pretty much the issue:
>>>>
>>>> https://issues.apache.org/jira/browse/ACCUMULO-2057
>>>>
>>>> Slightly different error message but it's a different version.  Looks
>>>> like its fixed in 1.6.0.  I'll probably need to upgrade.
>>>>
>>>>
>>>> On Sun, Jan 26, 2014 at 4:47 PM, Anthony F <afccri@gmail.com
>>>> <mailto:afccri@gmail.com>> wrote:
>>>>
>>>>     Thanks, I'll check Jira.  As for versions, Hadoop 2.2.0, Zk 3.4.5,
>>>>     CentOS 64bit (kernel 2.6.32-431.el6.x86_64).  Has much testing been
>>>>     done using Hadoop 2.2.0?  I tried Hadoop 2.0.0 (CDH 4.5.0) but ran
>>>>     into HDFS-5225/5031 which basically makes it a non-starter.
>>>>
>>>>
>>>>     On Sun, Jan 26, 2014 at 4:29 PM, Josh Elser <josh.elser@gmail.com
>>>>     <mailto:josh.elser@gmail.com>> wrote:
>>>>
>>>>         I meant to reply to your original email, but I didn't yet,
>>>> sorry.
>>>>
>>>>         First off, if Accumulo is reporting that it found multiple
>>>>         locations for the same extent, this is a (very bad) bug in
>>>>         Accumulo. It might be worth looking at tickets that at marked as
>>>>         "affects 1.5.0" and "fixed in 1.5.1" on Jira. It's likely that
>>>>         we've already encountered and fixed the issue, but, if you can't
>>>>         find a fix that was already made, we don't want to overlook the
>>>>         potential need for one.
>>>>
>>>>         For both "live" and "bulk" ingest, *neither* should lose any
>>>>         data. This is one thing that Accumulo should never be doing. If
>>>>         you have multiple locations for an extent, it seems plausible to
>>>>         me that you would run into data loss. However, you should focus
>>>>         on trying to determine why you keep running into multiple
>>>>         locations for a tablet.
>>>>
>>>>         After you take a look at Jira, I would likely go ahead and file
>>>>         a jira to track this since it's easier to follow than an email
>>>>         thread. Be sure to note if there is anything notable about your
>>>>         installation (did you download it directly from the
>>>>         accumulo.apache.org <http://accumulo.apache.org> site)? You
>>>>
>>>>         should also include what OS and version and what Hadoop and
>>>>         ZooKeeper versions you are running.
>>>>
>>>>
>>>>         On 1/26/2014 4:10 PM, Anthony F wrote:
>>>>
>>>>             I have observed a loss of data when tservers fail during
>>>>             bulk ingest.
>>>>             The keys that are missing are right around the table's
>>>>             splits indicating
>>>>             that data was lost when a tserver died during a split.  I am
>>>>             using
>>>>             Accumulo 1.5.0.  At around the same time, I observe the
>>>>             master logging a
>>>>             message about "Found two locations for the same extent".
>>>>               Can anyone
>>>>             shed light on this behavior?  Are tserver failures during
>>>>             bulk ingest
>>>>             supposed to be fault tolerant?
>>>>
>>>>
>>>>
>>>>
>>
>

Mime
View raw message