zookeeper-bookkeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@yahoo-inc.com>
Subject Re: BK servers in a funky state
Date Thu, 05 Apr 2012 16:57:25 GMT
Actually, I have a question about the new logs you sent. Have you tried to clean up the state
of bookies before rerunning them or these logs correspond to a bookie that you're bringing
back up and the bookie is trying to read its old state? If it is the latter, then I would
suggest to give it a fresh start. 

-Flavio

On Apr 5, 2012, at 4:53 PM, John Nagro wrote:

> Flavio -
> 
> I really appreciate your prompt response. Some quick background - we use some of the
hadoop technologies for storage, coordination, and processing. Recently we wanted to add a
write-ahead-log to our infrastructure so that clients could record "transactions" prior to
executing them - such as updates going to an API or processing of an event. I've written a
set of tools that use BK as a generic write-ahead-logger. Clients (using zookeeper for coordination)
can create named write ahead logs with custom chunking (how frequently a new ledger is created
- based on size/time). Once a ledger has rolled-over (or a client crashes), a persister (monitoring
ZK) reads that ledger and persists it to S3/HDFS as hadoop sequence files where a map-reduce
process can reconcile it. The ledger is then deleted from BK. This is all done using ZK in
a fashion where (hopefully) once a client has written any data to the ledger it will always
end up on S3/HDFS (via BK) even if the client crashes (the persister will always know which
ledger belongs to which log and which ledgers are currently in use).
> 
> Does that sound like an appropriate use of BK? It seemed like a natural fit as a durable
storage solution until something can reliably get it to a place where it would ultimately
be archived and could be reprocessed/reconciled (S3/HDFS).
> 
> As for the bug fix you mentioned, this gist shows the logs from the cut i made this morning:
> 
> https://gist.github.com/aea874d89b28d4cfef31
> 
> As you can see, there are still some exceptions and error messages that repeat (forever).
This is the newest cut available on github, last commit is:
> 
> commit f694716e289c448ab89cab5fa81ea0946f9d9193
> Author: Flavio Paiva Junqueira <fpj@apache.org>
> Date:   Tue Apr 3 16:02:44 2012 +0000
> 
> BOOKKEEPER-207: BenchBookie doesn't run correctly (ivank via fpj)
> 
> git-svn-id: https://svn.apache.org/repos/asf/zookeeper/bookkeeper/trunk@1309007 13f79535-47bb-0310-9956-ffa450edef68
> 
> 
> What are your thoughts? Thanks!
> 
> -John
> 
> 
> On Thu, Apr 5, 2012 at 10:10 AM, Flavio Junqueira <fpj@yahoo-inc.com> wrote:
> Hi John, Let's see if I can help:
> 
> On Apr 5, 2012, at 3:19 PM, John Nagro wrote:
> 
>> Hello -
>> 
>> I've been hitting Ivan up for advice about a bookkeeper project of mine. I recently
ran into another issue and he suggested I inquire here since he is traveling.
>> 
>> We've got a pool of 5 BK servers running in EC2. Last night they got into a funky
state and/or crashed - unfortunately the log with the original event got rotated (that has
been fixed). I was running a cut of 4.1.0-SNAPSHOT sha 6d56d60831a63fe9520ce156686d0cb1142e44f5
from Wed Mar 28 21:57:40 2012 +0000 which brought everything up to BOOKKEEPER-195. That build
had some bugfixes over 4.0.0 that I was originally running (and a previous version before
that).
>> 
> 
> Is there anything else you can say about your application, like how fast you're writing
and how often you're rolling ledgers maybe? Are you deleting ledgers at all?
> 
> 
>> When I restart the servers after the incident this is what the logs looked like:
>> 
>> https://gist.github.com/f2b9c8c76943b057546e
>> 
>> Which contain a lot of errors - although it appears the servers come up (i have not
tried to use the servers yet). Although I don't have the original stack that caused the crash,
the logs from recently after the crash contained a lot of this stack:
>> 
>> 2012-04-04 21:04:58,833 - INFO  [GarbageCollectorThread:GarbageCollectorThread@266]
- Deleting entryLogId 4 as it has no active ledgers!
>> 2012-04-04 21:04:58,834 - ERROR [GarbageCollectorThread:EntryLogger@188] - Trying
to delete an entryLog file that could not be found: 4.log
>> 2012-04-04 21:04:59,783 - WARN  [NIOServerFactory-3181:NIOServerFactory@129] - Exception
in server socket loop: /0.0.0.0
>> 
>> java.util.NoSuchElementException
>>         at java.util.LinkedList.getFirst(LinkedList.java:109)
>>         at org.apache.bookkeeper.bookie.LedgerCacheImpl.grabCleanPage(LedgerCacheImpl.java:458)
>>         at org.apache.bookkeeper.bookie.LedgerCacheImpl.putEntryOffset(LedgerCacheImpl.java:165)
>>         at org.apache.bookkeeper.bookie.LedgerDescriptorImpl.addEntry(LedgerDescriptorImpl.java:93)
>>         at org.apache.bookkeeper.bookie.Bookie.addEntryInternal(Bookie.java:999)
>>         at org.apache.bookkeeper.bookie.Bookie.addEntry(Bookie.java:1034)
>>         at org.apache.bookkeeper.proto.BookieServer.processPacket(BookieServer.java:359)
>>         at org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.readRequest(NIOServerFactory.java:315)
>>         at org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.doIO(NIOServerFactory.java:213)
>>         at org.apache.bookkeeper.proto.NIOServerFactory.run(NIOServerFactory.java:124)
> 
> This looks like what we found and resolved here:
> 
> 	https://issues.apache.org/jira/browse/BOOKKEEPER-198
> 
>> 
>> This morning I upgraded to the most recent cut - sha f694716e289c448ab89cab5fa81ea0946f9d9193
made on Tue Apr 3 16:02:44 2012 +0000 and restarted. That did not seem to correct matters,
although the log has slightly different error messages:
>> 
>> https://gist.github.com/aea874d89b28d4cfef31
>> 
>> Does anyone know whats going on? How i can correct these errors? Are the machines
in an okay state to use?
> 
> It sounds like we have resolved it in 198, so if you're using a recent cut, you shouldn't
observe this problem anymore. But, if it does happen again, it would be great to try to find
a way to reproduce it so that we can track the bug... assuming it is a bug.
> 
> -Flavio
> 
> 
> 

flavio
junqueira
senior research scientist
 
fpj@yahoo-inc.com
direct +34 93-183-8828
 
avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301


Mime
View raw message