accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim I <...@timisrael.com>
Subject Re: Persistent outstanding migrations message
Date Sat, 20 Aug 2016 16:16:37 GMT
Hi all,

Sorry, meant to reply back sooner.  I tried to delete the fate operation
and got the following error:

accumulo shell> fate delete 667becf32c0fe544
ERROR: Master lock is held, not running
Could not delete transaction: 667becf32c0fe544

Lock held according to zookeeper, so I went hunting in zookeeper for the
lock and found this (note matching txid)
zkCli> get /accumuloACCUMULO_INSTANCE_ID/table_locks/+default/lock-0000
000001
READ:667becf32c0fe544
...
ctime = Mon Jul 18 16:53:55 GMT 2016
...
mtime = Mon Jul 18 16:53:55 GMT 2016
...
numChildren = 0


Tried to remove and realized I need to auth as accumulo ... so did that and
rmr'd the lock (Note lock path from above):
zkCli> addauth digest accumulo:ACCUMULO_SECRET
zkCli> rmr /accumulo/ACCUMULO_INSTANCE_ID/table_locks/+default/lock-000
0000001

** It probably would have been safer to use "delete" instead of "rmr".
Just wanted to pass along a caution to future readers. **

The issue returned the next day.  So, I think we can rule this out as the
cause for the blocked migrations.

It's possible that these are legitimate migrations.  Although the number of
times the warning shows up is about 4000 over the course of a week or so.
I'll dig a little more and post back if I find something interesting.

Thanks!

Tim

On Fri, Aug 5, 2016 at 7:54 PM, Tim I <tim@timisrael.com> wrote:

> That was a good idea Josh.  IIRC, it was your post from 2015 that I found
> out about bouncing the Master because of possible old bugs.
>
> I checked the logs and found this:
>
> 18 16:53:55,535 [zookeeper.DistributedReadWriteLock] INFO : Added lock
> entry 1 userData 667becf32c0fe544 lockType READ
> 18 16:53:55,536 [tableOps.Utils] INFO : namespace +default
> (667becf32c0fe544) locked for read operation: COMPACT_CANCEL
> 18 16:53:55,542 [zookeeper.DistributedReadWriteLock] INFO : Added lock
> entry 0 userData 667becf32c0fe544 lockType READ
> 18 16:53:55,543 [tableOps.Utils] INFO : table 19 (667becf32c0fe544) locked
> for read operation: COMPACT_CANCEL
>
> I can't find record of the lock in zookeeper either.
>
> Will try to experiment more on Monday.
>
> I want to see if I can clear the logs, then wait for the migration
> warning, and finally repeat after deleting that outstanding fate operation
> (which does not seem to be tied anything).
>
> Thanks!
>
> Tim
>
> On Thu, Aug 4, 2016 at 11:17 PM, Josh Elser <josh.elser@gmail.com> wrote:
>
>> FWIW, migrations that never go away have been a symptom of bugs in the
>> Master before. The master gets into a state where it either stops
>> processing migrations or it doesn't realize that there is a migration to
>> process. You might be able to grep over the Master log and find information
>> about migrations. Sorry I don't have anything more specific.
>>
>> The lock without a FATE op also seems problematic, but might be unrelated
>> to the migration? You might be able to find more information in the master
>> log about that FATE transaction ID.
>>
>> Michael Wall wrote:
>>
>>> Are you currently experiencing 1 outstanding migration?  Does it go away
>>> on it's own?  Unless servers are going down, tablets will migrate when
>>> their split threshold is reached.  Is it possible you are constantly
>>> splitting a table?
>>>
>>> If all the tservers appear to be in good shape, maybe it is an issue
>>> with the master.  What does the jstack look like for that?
>>>
>>> On Thu, Aug 4, 2016 at 12:06 PM, Tim I <tim@timisrael.com
>>> <mailto:tim@timisrael.com>> wrote:
>>>
>>>     Hi Mike,
>>>
>>>     Thanks for the direction.
>>>
>>>     Empty result set from the scan you suggested
>>>
>>>     There was a lock without an associated FATE operation.
>>>
>>>         The following locks did not have an associated FATE operation
>>>         txid: 667becf32c0fe544  locked: [R:+default]
>>>
>>>
>>>     No recoveries stuck currently, and no long running scans.
>>>
>>>     Otherwise, the system seems fine.
>>>
>>>     Is it possible this is just benign?  Should we monitor for locks
>>>     that don't have FATE operations and delete them from time to time?
>>>
>>>     Thanks,
>>>
>>>     Tim
>>>
>>>     On Thu, Aug 4, 2016 at 11:44 AM, Michael Wall <mjwall@gmail.com
>>>     <mailto:mjwall@gmail.com>> wrote:
>>>
>>>         Hi Tim,
>>>
>>>         You can try scanning the metadata table for a future colfam.
>>>         Something like
>>>
>>>         scan -t accumulo.metadata -c fut
>>>
>>>         If you find one, look at the tabletserver that is slated to host
>>>         that tablet.  There could be an issue with that server
>>>         preventing assignment from completing.  Get a jstack and save
>>>         the logs so you can further troubleshoot.  Killing that tserver
>>>         will cause the assignment to go elsewhere, but make sure you get
>>>         as much info as you can before killing it.
>>>
>>>         What else is going on with the system?  Do you have any
>>>         recoveries that are stuck?  Are there any fate transactions that
>>>         have been running for a while?  Any long running scans?
>>>
>>>         HTH
>>>
>>>         Mike
>>>
>>>         On Thu, Aug 4, 2016 at 11:04 AM, Tim I <tim@timisrael.com
>>>         <mailto:tim@timisrael.com>> wrote:
>>>
>>>             Hi all,
>>>
>>>             We're running accumulo 1.6.5
>>>
>>>             One of the issues we're seeing on a consistent basis is this
>>>             message:
>>>
>>>                 "Not balancing due to 1 outstanding migrations".
>>>
>>>
>>>             Is there a simple way to see the number of outstanding
>>>             migrations?  Based on what we've read and experienced, it
>>>             eventually means we have to bounce the master to get things
>>>             to a better state, however the message comes back within
>>>             about 1 hour.
>>>
>>>             Any thoughts and suggestions would be greatly appreciated.
>>>
>>>             Thanks,
>>>
>>>             Tim
>>>
>>>
>>>
>>>
>>>
>

Mime
View raw message