kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: [KUDU Tablet]unrecoverable crash
Date Sat, 20 Feb 2016 01:28:07 GMT
If you have a replicated cluster, it's likely that the master already
re-replicated non-corrupt versions of these tablets to other machines.

It would still be great if you can send one of the WAL directories to
me off-list so I can take a look and try to understand what's going
on.

Thanks
-Todd

On Fri, Feb 19, 2016 at 5:05 PM, Nick Wolf <nickwolf7@gmail.com> wrote:
> I've identified the tablet ID and tried to delete and start the server but
> it seems like chain reaction. They keep coming one by one with different
> tablet ids.
> rm tablet-meta/1c2475126c7c4cc2b82f95bd6af5bdb4
> rm wals/1c2475126c7c4cc2b82f95bd6af5bdb4
> rm consensus-meta/1c2475126c7c4cc2b82f95bd6af5bdb4
>
> A notable point here is none of the tablet ids that it shows bootstrapping
> are not appearing in web interface. (http://host:8051/tables)
>
> On Fri, Feb 19, 2016 at 12:49 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>
>> Hi Nick,
>>
>> Are you able to determine the tablet ID that is failing to restart?
>> The log line indicates that it's thread ID 6285. If you look farther
>> up the log with 'grep " 6285 " kudu-tserver.INFO', you should see a
>> log message indicating that that thread is starting to bootstrap a
>> particular tablet.
>>
>> Is this a replicated table, or num_replicas=1? If it's replicated, we
>> can probably recover by removing the corrupt replica and letting it
>> grab a new copy from one of the other replicas. Otherwise, we'll have
>> to do some more serious "surgery" which we can assist you with.
>>
>> Either way, see if you can figure out the bad tablet ID. Then, if it's
>> possible to send a copy of the WAL directory for this tablet to me off
>> list, I can try to do some post-mortem analysis to see what went
>> wrong.
>>
>> Thanks
>> -Todd
>>
>> On Fri, Feb 19, 2016 at 12:37 PM, Nick Wolf <nickwolf7@gmail.com> wrote:
>> > KUDU Tablet crashed with following fatal error.
>> >
>> > F0219 12:15:11.389806  6285 mvcc.cc:542] Check failed: _s.ok() Bad
>> > status:
>> > Illegal state: Timestamp: 5963266013874102274 is already committed.
>> > Current
>> > Snapshot: MvccSnapshot[committed={T|T < 5963266013874118554 or (T in
>> > {5963266013874118554})}]
>> >
>> > It throws the same fatal error and crashes immediately no matter how
>> > many
>> > times i try to restart the service.
>> >
>> > Any ideas to get out of this situation? I don't want to lose the data.
>> >
>> >
>> > --Nick
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message