zookeeper-bookkeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@yahoo-inc.com>
Subject Re: BK servers in a funky state
Date Fri, 06 Apr 2012 13:13:42 GMT
Hi John, I have created a jira to report my findings so far. Here is a link in the case you're
interested:

	https://issues.apache.org/jira/browse/BOOKKEEPER-212

Feel free to add to the discussion if you see anything I'm missing.

-Flavio

On Apr 5, 2012, at 9:57 PM, John Nagro wrote:

> Thanks Flavio.
> 
> On Thu, Apr 5, 2012 at 3:36 PM, Flavio Junqueira <fpj@yahoo-inc.com> wrote:
> Exactly, that's my suspicion so far, but we need to make sure. I'm also seeing if I can
reproduce the problem independently.
> 
> -Flavio
> 
> 
> On Apr 5, 2012, at 9:28 PM, John Nagro wrote:
> 
>> Okay cool, thanks for the help Flavio. So, just to recap so that I understand, you
suspect that we experienced BOOKKEEPER-198, put the machines into a funky state, and now we
just need to start over with these servers - however 198 has been patched and we shouldn't
see this specific failure mode again with a current cut of the code?
>> 
>> Thanks again for all your assistance.
>> 
>> -John
>> 
>> On Thu, Apr 5, 2012 at 3:05 PM, Flavio Junqueira <fpj@yahoo-inc.com> wrote:
>> 
>> On Apr 5, 2012, at 8:41 PM, John Nagro wrote:
>> 
>>> All 5 machines behave the same way. It is not the end of the world if we can't
recover the data from them in this instance, but at some point the data sent through this
WAL will be very important to be able to recover in the event of a crash.
>> 
>> Understood and agreed, losing data is not an option. Unfortunately, it sounds like
you have hit a nasty bug, let's make sure to fix it because I'm sure it could also hit others,
especially if BOOKKEEPER-198 does not fix it. 
>> 
>>> 
>>> I tried to move the logs/current directory out of the way, replacing it with
another and got this:
>>> 
>>> 2012-04-05 14:39:31,829 - INFO  [main-EventThread:Bookie$3@678] - ZK client has
been reconnected to the ZK server!
>>> 2012-04-05 14:39:31,854 - ERROR [main:Bookie@331] - Cookie exists in zookeeper,
but not in all local directories
>>> java.io.FileNotFoundException: /opt/bookkeeper/logs/current/VERSION (No such
file or directory)
>>> 	at java.io.FileInputStream.open(Native Method)
>>> 	at java.io.FileInputStream.<init>(FileInputStream.java:106)
>>> 	at java.util.Scanner.<init>(Scanner.java:636)
>>> 	at org.apache.bookkeeper.bookie.Cookie.readFromDirectory(Cookie.java:188)
>>> 	at org.apache.bookkeeper.bookie.Bookie.checkEnvironment(Bookie.java:326)
>>> 	at org.apache.bookkeeper.bookie.Bookie.<init>(Bookie.java:401)
>>> 	at org.apache.bookkeeper.proto.BookieServer.<init>(BookieServer.java:75)
>>> 	at org.apache.bookkeeper.proto.BookieServer.main(BookieServer.java:294)
>>> 2012-04-05 14:39:31,856 - ERROR [main:BookieServer@308] - Exception running bookie
server : 
>>> org.apache.bookkeeper.bookie.BookieException$InvalidCookieException
>>> 	at org.apache.bookkeeper.bookie.Bookie.checkEnvironment(Bookie.java:332)
>>> 	at org.apache.bookkeeper.bookie.Bookie.<init>(Bookie.java:401)
>>> 	at org.apache.bookkeeper.proto.BookieServer.<init>(BookieServer.java:75)
>>> 	at org.apache.bookkeeper.proto.BookieServer.main(BookieServer.java:294)
>>> 
>>> which makes me believe i'd need to clean out the cookie entry in ZK for this
server?
>> 
>> Yes, you need to clean up and restart zookeeper too.
>> 
>> -Flavio
>> 
>>> 
>>> On Thu, Apr 5, 2012 at 1:54 PM, Flavio Junqueira <fpj@yahoo-inc.com> wrote:
>>> By cleaning it up, I meant deleting the log and index files. You must have passed
paths when you started bookies, I was suggesting to wipe them out. 
>>> 
>>> I just realized that there are a couple of things I don't understand:
>>> 
>>> 1- How many bookies have been affected? Just one or all of them? If it is just
one, then you may need to use the bookie recovery tool to reconstruct the ledger fragments
of the problematic bookie.
>>> 2- Can you clean up all bookies and zookeeper and start it over or there is data
you need to recover? 
>>> 
>>> -Flavio
>>> 
>>> On Apr 5, 2012, at 7:32 PM, John Nagro wrote:
>>> 
>>>> Flavio -
>>>> 
>>>> The log is from an existing server which crashed and I am trying to bring
it back up. How might i clean things up? What about existing ledgers?
>>>> 
>>>> -John
>>>> 
>>>> On Thu, Apr 5, 2012 at 12:57 PM, Flavio Junqueira <fpj@yahoo-inc.com>
wrote:
>>>> Actually, I have a question about the new logs you sent. Have you tried to
clean up the state of bookies before rerunning them or these logs correspond to a bookie that
you're bringing back up and the bookie is trying to read its old state? If it is the latter,
then I would suggest to give it a fresh start. 
>>>> 
>>>> -Flavio
>>>> 
>>>> On Apr 5, 2012, at 4:53 PM, John Nagro wrote:
>>>> 
>>>>> Flavio -
>>>>> 
>>>>> I really appreciate your prompt response. Some quick background - we
use some of the hadoop technologies for storage, coordination, and processing. Recently we
wanted to add a write-ahead-log to our infrastructure so that clients could record "transactions"
prior to executing them - such as updates going to an API or processing of an event. I've
written a set of tools that use BK as a generic write-ahead-logger. Clients (using zookeeper
for coordination) can create named write ahead logs with custom chunking (how frequently a
new ledger is created - based on size/time). Once a ledger has rolled-over (or a client crashes),
a persister (monitoring ZK) reads that ledger and persists it to S3/HDFS as hadoop sequence
files where a map-reduce process can reconcile it. The ledger is then deleted from BK. This
is all done using ZK in a fashion where (hopefully) once a client has written any data to
the ledger it will always end up on S3/HDFS (via BK) even if the client crashes (the persister
will always know which ledger belongs to which log and which ledgers are currently in use).
>>>>> 
>>>>> Does that sound like an appropriate use of BK? It seemed like a natural
fit as a durable storage solution until something can reliably get it to a place where it
would ultimately be archived and could be reprocessed/reconciled (S3/HDFS).
>>>>> 
>>>>> As for the bug fix you mentioned, this gist shows the logs from the cut
i made this morning:
>>>>> 
>>>>> https://gist.github.com/aea874d89b28d4cfef31
>>>>> 
>>>>> As you can see, there are still some exceptions and error messages that
repeat (forever). This is the newest cut available on github, last commit is:
>>>>> 
>>>>> commit f694716e289c448ab89cab5fa81ea0946f9d9193
>>>>> Author: Flavio Paiva Junqueira <fpj@apache.org>
>>>>> Date:   Tue Apr 3 16:02:44 2012 +0000
>>>>> 
>>>>> BOOKKEEPER-207: BenchBookie doesn't run correctly (ivank via fpj)
>>>>> 
>>>>> git-svn-id: https://svn.apache.org/repos/asf/zookeeper/bookkeeper/trunk@1309007
13f79535-47bb-0310-9956-ffa450edef68
>>>>> 
>>>>> 
>>>>> What are your thoughts? Thanks!
>>>>> 
>>>>> -John
>>>>> 
>>>>> 
>>>>> On Thu, Apr 5, 2012 at 10:10 AM, Flavio Junqueira <fpj@yahoo-inc.com>
wrote:
>>>>> Hi John, Let's see if I can help:
>>>>> 
>>>>> On Apr 5, 2012, at 3:19 PM, John Nagro wrote:
>>>>> 
>>>>>> Hello -
>>>>>> 
>>>>>> I've been hitting Ivan up for advice about a bookkeeper project of
mine. I recently ran into another issue and he suggested I inquire here since he is traveling.
>>>>>> 
>>>>>> We've got a pool of 5 BK servers running in EC2. Last night they
got into a funky state and/or crashed - unfortunately the log with the original event got
rotated (that has been fixed). I was running a cut of 4.1.0-SNAPSHOT sha 6d56d60831a63fe9520ce156686d0cb1142e44f5
from Wed Mar 28 21:57:40 2012 +0000 which brought everything up to BOOKKEEPER-195. That build
had some bugfixes over 4.0.0 that I was originally running (and a previous version before
that).
>>>>>> 
>>>>> 
>>>>> Is there anything else you can say about your application, like how fast
you're writing and how often you're rolling ledgers maybe? Are you deleting ledgers at all?
>>>>> 
>>>>> 
>>>>>> When I restart the servers after the incident this is what the logs
looked like:
>>>>>> 
>>>>>> https://gist.github.com/f2b9c8c76943b057546e
>>>>>> 
>>>>>> Which contain a lot of errors - although it appears the servers come
up (i have not tried to use the servers yet). Although I don't have the original stack that
caused the crash, the logs from recently after the crash contained a lot of this stack:
>>>>>> 
>>>>>> 2012-04-04 21:04:58,833 - INFO  [GarbageCollectorThread:GarbageCollectorThread@266]
- Deleting entryLogId 4 as it has no active ledgers!
>>>>>> 2012-04-04 21:04:58,834 - ERROR [GarbageCollectorThread:EntryLogger@188]
- Trying to delete an entryLog file that could not be found: 4.log
>>>>>> 2012-04-04 21:04:59,783 - WARN  [NIOServerFactory-3181:NIOServerFactory@129]
- Exception in server socket loop: /0.0.0.0
>>>>>> 
>>>>>> java.util.NoSuchElementException
>>>>>>         at java.util.LinkedList.getFirst(LinkedList.java:109)
>>>>>>         at org.apache.bookkeeper.bookie.LedgerCacheImpl.grabCleanPage(LedgerCacheImpl.java:458)
>>>>>>         at org.apache.bookkeeper.bookie.LedgerCacheImpl.putEntryOffset(LedgerCacheImpl.java:165)
>>>>>>         at org.apache.bookkeeper.bookie.LedgerDescriptorImpl.addEntry(LedgerDescriptorImpl.java:93)
>>>>>>         at org.apache.bookkeeper.bookie.Bookie.addEntryInternal(Bookie.java:999)
>>>>>>         at org.apache.bookkeeper.bookie.Bookie.addEntry(Bookie.java:1034)
>>>>>>         at org.apache.bookkeeper.proto.BookieServer.processPacket(BookieServer.java:359)
>>>>>>         at org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.readRequest(NIOServerFactory.java:315)
>>>>>>         at org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.doIO(NIOServerFactory.java:213)
>>>>>>         at org.apache.bookkeeper.proto.NIOServerFactory.run(NIOServerFactory.java:124)
>>>>> 
>>>>> This looks like what we found and resolved here:
>>>>> 
>>>>> 	https://issues.apache.org/jira/browse/BOOKKEEPER-198
>>>>> 
>>>>>> 
>>>>>> This morning I upgraded to the most recent cut - sha f694716e289c448ab89cab5fa81ea0946f9d9193
made on Tue Apr 3 16:02:44 2012 +0000 and restarted. That did not seem to correct matters,
although the log has slightly different error messages:
>>>>>> 
>>>>>> https://gist.github.com/aea874d89b28d4cfef31
>>>>>> 
>>>>>> Does anyone know whats going on? How i can correct these errors?
Are the machines in an okay state to use?
>>>>> 
>>>>> It sounds like we have resolved it in 198, so if you're using a recent
cut, you shouldn't observe this problem anymore. But, if it does happen again, it would be
great to try to find a way to reproduce it so that we can track the bug... assuming it is
a bug.
>>>>> 
>>>>> -Flavio
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> flavio
>>>> junqueira
>>>> senior research scientist
>>>>  
>>>> fpj@yahoo-inc.com
>>>> direct +34 93-183-8828
>>>>  
>>>> avinguda diagonal 177, 8th floor, barcelona, 08018, es
>>>> phone (408) 349 3300    fax (408) 349 3301
>>>> 
>>>> 
>>> 
>>> flavio
>>> junqueira
>>> senior research scientist
>>>  
>>> fpj@yahoo-inc.com
>>> direct +34 93-183-8828
>>>  
>>> avinguda diagonal 177, 8th floor, barcelona, 08018, es
>>> phone (408) 349 3300    fax (408) 349 3301
>>> 
>>> 
>> 
>> flavio
>> junqueira
>> senior research scientist
>>  
>> fpj@yahoo-inc.com
>> direct +34 93-183-8828
>>  
>> avinguda diagonal 177, 8th floor, barcelona, 08018, es
>> phone (408) 349 3300    fax (408) 349 3301
>> 
>> 
> 
> flavio
> junqueira
> senior research scientist
>  
> fpj@yahoo-inc.com
> direct +34 93-183-8828
>  
> avinguda diagonal 177, 8th floor, barcelona, 08018, es
> phone (408) 349 3300    fax (408) 349 3301
> 
> 

flavio
junqueira
senior research scientist
 
fpj@yahoo-inc.com
direct +34 93-183-8828
 
avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301


Mime
View raw message