accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Failed CI verify (was Re: [VOTE] Apache Accumulo 1.6.1 RC1)
Date Sat, 27 Sep 2014 04:36:32 GMT
Welp, after 8 hrs of memtest86+ with no errors, followed by 4B CI 
(~11hrs) with 2 tservers + random manual `kill -9`'ing (same 
characteristics as the first run), I just had a clean verify.

REFERENCED=4034576211
UNREFERENCED=1000161

I did update to a newer version of 2.6.0-SNAPSHOT and updated to a newer 
kernel (3.16.3 over 3.16.2).

Sometimes I wonder if I really understand computers :)

Josh Elser wrote:
> The crux of it is that both of the errors in the CRC where single bit
> "variants".
>
> y instead of 9 and p instead of 0
>
> Both of these cases are a '1' in the most significant bit of the byte
> instead of a '0'. We recognized these because y and p are outside of the
> hex range. Fixing both of these fixes the CRC error (manually verified).
>
> That's all we know right now. I'm currently running memtest86. I do not
> have ECC ram, so it *is* theoretically possible that was the cause.
> After running memtest for a day or so (or until I need my desktop
> functional again), I'll go back and see if I can reproduce this again.
>
> Mike Drob wrote:
>> Any chance the IRC chats can make it only the ML for posterity?
>>
>> Mike
>>
>> On Wed, Sep 24, 2014 at 12:04 PM, Keith Turner<keith@deenlo.com> wrote:
>>
>>> On Wed, Sep 24, 2014 at 12:44 PM, Russ Weeks<rweeks@newbrightidea.com>
>>> wrote:
>>>
>>>> Interesting that "y" (0x79) and "9" (0x39) are one bit "away" from each
>>>> other. I blame cosmic rays!
>>>>
>>> It is interesting, and thats only half of the story. Its been
>>> interesting
>>> chatting w/ Josh about this on irc and hearing about his findings.
>>>
>>>
>>>> On Wed, Sep 24, 2014 at 9:05 AM, Josh Elser<josh.elser@gmail.com>
>>> wrote:
>>>>>>> The offending keys are:
>>>>>>>
>>>>>>> 389a85668b6ebf8e 2ff6:4a78 [] 1411499115242
>>>>>>>
>>>>>>> 3a10885b-d481-4d00-be00-0477e231ey65:000000008576b169:
>>>>>>> 0cd98965c9ccc1d0:ba15529e
>>>>>>>
>>>>> The careful eye will notice that the UUID in the first component of
>>>>> the
>>>>> value has a different suffix than the next corrupt key/value (ends
>>>>> with
>>>>> "ey65" instead of "e965"). Fixing this in the Value and re-running the
>>>> CRC
>>>>> makes it pass.
>>>>>
>>>>>
>>>>> and
>>>>>>> 7e56b58a0c7df128 5fa0:6249 [] 1411499311578
>>>>>>>
>>>>>>> 3a10885b-d481-4d00-be00-0477e231e965:0000p000872d60eb:
>>>>>>> 499fa72752d82a7c:5c5f19e8
>>>>>>>
>>>>>>>
>>

Mime
View raw message