zookeeper-bookkeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@yahoo-inc.com>
Subject Re: BK servers in a funky state
Date Thu, 05 Apr 2012 16:21:21 GMT
Hi John, Thanks for all information. It does sounds like an appropriate use of BK to me, and
the tools you developed seem nice too. In fact, we were thinking of adding named logs on top,
so it's cool that you've also thought of it.

About the bug, I can't really tell what's going on right now, I'll need to dig into the logs
and perhaps try to reproduce it myself to see if I can spot the problem. Let me get back to
you once I have news. In the meanwhile, if you have any new info, please make sure to report
to the list.

-Flavio

On Apr 5, 2012, at 4:57 PM, John Nagro wrote:

> Flavio -
> 
> I forgot to mention some scale. At the moment lets say we create a couple dozen ledgers
a minute, we persist them at about the same pace. If something goes wrong (which has happened
a few times - new software) it is not uncommon to see 10's of thousands of ledgers in BK.
> 
> Thanks.
> 
> -John
> 
> On Thu, Apr 5, 2012 at 10:53 AM, John Nagro <jnagro@hubspot.com> wrote:
> Flavio -
> 
> I really appreciate your prompt response. Some quick background - we use some of the
hadoop technologies for storage, coordination, and processing. Recently we wanted to add a
write-ahead-log to our infrastructure so that clients could record "transactions" prior to
executing them - such as updates going to an API or processing of an event. I've written a
set of tools that use BK as a generic write-ahead-logger. Clients (using zookeeper for coordination)
can create named write ahead logs with custom chunking (how frequently a new ledger is created
- based on size/time). Once a ledger has rolled-over (or a client crashes), a persister (monitoring
ZK) reads that ledger and persists it to S3/HDFS as hadoop sequence files where a map-reduce
process can reconcile it. The ledger is then deleted from BK. This is all done using ZK in
a fashion where (hopefully) once a client has written any data to the ledger it will always
end up on S3/HDFS (via BK) even if the client crashes (the persister will always know which
ledger belongs to which log and which ledgers are currently in use).
> 
> Does that sound like an appropriate use of BK? It seemed like a natural fit as a durable
storage solution until something can reliably get it to a place where it would ultimately
be archived and could be reprocessed/reconciled (S3/HDFS).
> 
> As for the bug fix you mentioned, this gist shows the logs from the cut i made this morning:
> 
> https://gist.github.com/aea874d89b28d4cfef31
> 
> As you can see, there are still some exceptions and error messages that repeat (forever).
This is the newest cut available on github, last commit is:
> 
> commit f694716e289c448ab89cab5fa81ea0946f9d9193
> Author: Flavio Paiva Junqueira <fpj@apache.org>
> Date:   Tue Apr 3 16:02:44 2012 +0000
> 
> BOOKKEEPER-207: BenchBookie doesn't run correctly (ivank via fpj)
> 
> git-svn-id: https://svn.apache.org/repos/asf/zookeeper/bookkeeper/trunk@1309007 13f79535-47bb-0310-9956-ffa450edef68
> 
> 
> What are your thoughts? Thanks!
> 
> -John
> 
> 
> On Thu, Apr 5, 2012 at 10:10 AM, Flavio Junqueira <fpj@yahoo-inc.com> wrote:
> Hi John, Let's see if I can help:
> 
> On Apr 5, 2012, at 3:19 PM, John Nagro wrote:
> 
>> Hello -
>> 
>> I've been hitting Ivan up for advice about a bookkeeper project of mine. I recently
ran into another issue and he suggested I inquire here since he is traveling.
>> 
>> We've got a pool of 5 BK servers running in EC2. Last night they got into a funky
state and/or crashed - unfortunately the log with the original event got rotated (that has
been fixed). I was running a cut of 4.1.0-SNAPSHOT sha 6d56d60831a63fe9520ce156686d0cb1142e44f5
from Wed Mar 28 21:57:40 2012 +0000 which brought everything up to BOOKKEEPER-195. That build
had some bugfixes over 4.0.0 that I was originally running (and a previous version before
that).
>> 
> 
> Is there anything else you can say about your application, like how fast you're writing
and how often you're rolling ledgers maybe? Are you deleting ledgers at all?
> 
> 
>> When I restart the servers after the incident this is what the logs looked like:
>> 
>> https://gist.github.com/f2b9c8c76943b057546e
>> 
>> Which contain a lot of errors - although it appears the servers come up (i have not
tried to use the servers yet). Although I don't have the original stack that caused the crash,
the logs from recently after the crash contained a lot of this stack:
>> 
>> 2012-04-04 21:04:58,833 - INFO  [GarbageCollectorThread:GarbageCollectorThread@266]
- Deleting entryLogId 4 as it has no active ledgers!
>> 2012-04-04 21:04:58,834 - ERROR [GarbageCollectorThread:EntryLogger@188] - Trying
to delete an entryLog file that could not be found: 4.log
>> 2012-04-04 21:04:59,783 - WARN  [NIOServerFactory-3181:NIOServerFactory@129] - Exception
in server socket loop: /0.0.0.0
>> 
>> java.util.NoSuchElementException
>>         at java.util.LinkedList.getFirst(LinkedList.java:109)
>>         at org.apache.bookkeeper.bookie.LedgerCacheImpl.grabCleanPage(LedgerCacheImpl.java:458)
>>         at org.apache.bookkeeper.bookie.LedgerCacheImpl.putEntryOffset(LedgerCacheImpl.java:165)
>>         at org.apache.bookkeeper.bookie.LedgerDescriptorImpl.addEntry(LedgerDescriptorImpl.java:93)
>>         at org.apache.bookkeeper.bookie.Bookie.addEntryInternal(Bookie.java:999)
>>         at org.apache.bookkeeper.bookie.Bookie.addEntry(Bookie.java:1034)
>>         at org.apache.bookkeeper.proto.BookieServer.processPacket(BookieServer.java:359)
>>         at org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.readRequest(NIOServerFactory.java:315)
>>         at org.apache.bookkeeper.proto.NIOServerFactory$Cnxn.doIO(NIOServerFactory.java:213)
>>         at org.apache.bookkeeper.proto.NIOServerFactory.run(NIOServerFactory.java:124)
> 
> This looks like what we found and resolved here:
> 
> 	https://issues.apache.org/jira/browse/BOOKKEEPER-198
> 
>> 
>> This morning I upgraded to the most recent cut - sha f694716e289c448ab89cab5fa81ea0946f9d9193
made on Tue Apr 3 16:02:44 2012 +0000 and restarted. That did not seem to correct matters,
although the log has slightly different error messages:
>> 
>> https://gist.github.com/aea874d89b28d4cfef31
>> 
>> Does anyone know whats going on? How i can correct these errors? Are the machines
in an okay state to use?
> 
> It sounds like we have resolved it in 198, so if you're using a recent cut, you shouldn't
observe this problem anymore. But, if it does happen again, it would be great to try to find
a way to reproduce it so that we can track the bug... assuming it is a bug.
> 
> -Flavio
> 
> 
> 
> 

flavio
junqueira
senior research scientist
 
fpj@yahoo-inc.com
direct +34 93-183-8828
 
avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301


Mime
View raw message