accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony F <afc...@gmail.com>
Subject Re: data loss around splits when tserver goes down
Date Mon, 27 Jan 2014 12:57:00 GMT
I took a look in the code . . . the stack trace is not quite the same.  In
1.6.0, the fixed issue related to METADATA_LAST_LOCATION_COLUMN_FAMILY.
 The issue I am seeing (in 1.5.0) is related to
METADATA_CURRENT_LOCATION_COLUMN_FAMILY (line 144).


On Sun, Jan 26, 2014 at 7:00 PM, Anthony F <afccri@gmail.com> wrote:

> The stack trace is pretty close and the steps to reproduce match the
> scenario in which I observed the issue.  But there's no fix (in Jira)
> against 1.5.0, just 1.6.0.
>
>
> On Sun, Jan 26, 2014 at 5:56 PM, Josh Elser <josh.elser@gmail.com> wrote:
>
>> Just because the error message is the same doesn't mean that the root
>> cause is also the same.
>>
>> Without looking more into Eric's changes, I'm not sure if ACCUMULO-2057
>> would also affect 1.5.0. We're usually pretty good about checking backwards
>> when bugs are found in newer versions, but things slip through the cracks,
>> too.
>>
>>
>> On 1/26/2014 5:09 PM, Anthony F wrote:
>>
>>> This is pretty much the issue:
>>>
>>> https://issues.apache.org/jira/browse/ACCUMULO-2057
>>>
>>> Slightly different error message but it's a different version.  Looks
>>> like its fixed in 1.6.0.  I'll probably need to upgrade.
>>>
>>>
>>> On Sun, Jan 26, 2014 at 4:47 PM, Anthony F <afccri@gmail.com
>>> <mailto:afccri@gmail.com>> wrote:
>>>
>>>     Thanks, I'll check Jira.  As for versions, Hadoop 2.2.0, Zk 3.4.5,
>>>     CentOS 64bit (kernel 2.6.32-431.el6.x86_64).  Has much testing been
>>>     done using Hadoop 2.2.0?  I tried Hadoop 2.0.0 (CDH 4.5.0) but ran
>>>     into HDFS-5225/5031 which basically makes it a non-starter.
>>>
>>>
>>>     On Sun, Jan 26, 2014 at 4:29 PM, Josh Elser <josh.elser@gmail.com
>>>     <mailto:josh.elser@gmail.com>> wrote:
>>>
>>>         I meant to reply to your original email, but I didn't yet, sorry.
>>>
>>>         First off, if Accumulo is reporting that it found multiple
>>>         locations for the same extent, this is a (very bad) bug in
>>>         Accumulo. It might be worth looking at tickets that at marked as
>>>         "affects 1.5.0" and "fixed in 1.5.1" on Jira. It's likely that
>>>         we've already encountered and fixed the issue, but, if you can't
>>>         find a fix that was already made, we don't want to overlook the
>>>         potential need for one.
>>>
>>>         For both "live" and "bulk" ingest, *neither* should lose any
>>>         data. This is one thing that Accumulo should never be doing. If
>>>         you have multiple locations for an extent, it seems plausible to
>>>         me that you would run into data loss. However, you should focus
>>>         on trying to determine why you keep running into multiple
>>>         locations for a tablet.
>>>
>>>         After you take a look at Jira, I would likely go ahead and file
>>>         a jira to track this since it's easier to follow than an email
>>>         thread. Be sure to note if there is anything notable about your
>>>         installation (did you download it directly from the
>>>         accumulo.apache.org <http://accumulo.apache.org> site)? You
>>>
>>>         should also include what OS and version and what Hadoop and
>>>         ZooKeeper versions you are running.
>>>
>>>
>>>         On 1/26/2014 4:10 PM, Anthony F wrote:
>>>
>>>             I have observed a loss of data when tservers fail during
>>>             bulk ingest.
>>>             The keys that are missing are right around the table's
>>>             splits indicating
>>>             that data was lost when a tserver died during a split.  I am
>>>             using
>>>             Accumulo 1.5.0.  At around the same time, I observe the
>>>             master logging a
>>>             message about "Found two locations for the same extent".
>>>               Can anyone
>>>             shed light on this behavior?  Are tserver failures during
>>>             bulk ingest
>>>             supposed to be fault tolerant?
>>>
>>>
>>>
>>>
>

Mime
View raw message