flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject race conditions and WAL shutdown
Date Sun, 14 Aug 2011 10:18:07 GMT
I'm starting to play with Flume and running into some issues. I've found a couple of tickets
that talk about similar things, but  wanted to check  if it was a problem on my side…

Using 0.9.4 bin distro on OSX lion, all local, Java 1.6.0_26

start a master and bin/flume node_nowatch -n agen and bin/flume node_nowatch -n coll

Then add the following config via the web portal for the master (am using batch / unbatch
as a test for aggregating data later)…

agen : text("/my/file") | agentSink("localhost",35853) ; 
coll : collectorSource(35853) | batch(100) unbatch console ; 

On the agen node it goes bang and shuts it's self down…

21:40:53,064  INFO NaiveFileWALManager:179 - NaiveFileWALManager is now open
21:40:53,066  INFO InsistentOpenDecorator:128 - open attempt 0 failed, backoff (1000ms): Failed
to open thrift event sink to localhost:35853 : java.net.ConnectException: Connection refused
21:40:54,063 ERROR DirectDriver:311 - Expected ACTIVE but timed out in state OPENING
21:40:54,069  INFO ThriftEventSink:112 - ThriftEventSink to localhost:35853 opened
21:40:54,069  INFO InsistentOpenDecorator:116 - Opened ThriftEventSink on try 1
21:40:54,069  INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting down
21:40:54,069  INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
21:40:54,218  INFO TextFileSource:70 - File /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
closed
21:40:54,218  INFO RollSink:206 - closing RollSink 'ackingWal'
21:40:54,219  INFO SeqfileEventSink:92 - closed /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
21:40:54,219  INFO AckListener$Empty:45 - Empty Ack Listener ended 20110814-214053048+1200.1313314853048500000.00000023
21:40:54,220  INFO NaiveFileWALManager:474 - File lives in /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
21:40:54,220  WARN NaiveFileWALManager:205 - Already shutting down, but getting another shutting
down notice, odd
21:40:54,221  INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting down

I waited for a while, no retry. So killed both processes and started the coll then the agen
process. The lines from the source file were delivered and this is what I see in the agen
process…

21:52:43,397  INFO NaiveFileWALManager:760 - opening log file 20110814-214053048+1200.1313314853048500000.00000023
21:52:43,605  INFO TextFileSource:70 - File /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
closed
21:52:43,605  INFO RollSink:206 - closing RollSink 'ackingWal'
21:52:43,608  INFO SeqfileEventSink:92 - closed /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
21:52:43,609  INFO AckListener$Empty:45 - Empty Ack Listener ended 20110814-215243380+1200.1313315563380444000.00000023
21:52:43,609  INFO NaiveFileWALManager:474 - File lives in /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
21:52:43,610  INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting down
21:52:43,952  INFO WALAckManager:92 - Ack for 20110814-214053048+1200.1313314853048500000.00000023
is queued to be checked
21:52:43,952  INFO WALSource:97 - end of file NaiveFileWALManager (dir=/tmp/flume-aaron/agent/agen
)
21:52:43,952  INFO NaiveFileWALManager:760 - opening log file 20110814-215243380+1200.1313315563380444000.00000023
21:52:44,110 ERROR DirectDriver:311 - Expected IDLE but timed out in state ACTIVE
21:52:44,111  INFO NaiveFileWALDeco:185 - Closing wal log, subsink still making progress
21:52:44,316  INFO WALAckManager:92 - Ack for 20110814-215243380+1200.1313315563380444000.00000023
is queued to be checked
21:52:44,316  INFO WALSource:97 - end of file NaiveFileWALManager (dir=/tmp/flume-aaron/agent/agen
)
21:52:44,517  WARN NaiveFileWALManager:205 - Already shutting down, but getting another shutting
down notice, odd
21:52:44,518  INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting down
21:52:44,518  INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
21:53:48,409  INFO WALAckManager:153 - Retransmitting 20110814-215243380+1200.1313315563380444000.00000023
after being stale for 64093ms
21:53:48,409  INFO WALAckManager:153 - Retransmitting 20110814-214053048+1200.1313314853048500000.00000023
after being stale for 64457ms
21:54:48,444  INFO WALAckManager:153 - Retransmitting 20110814-215243380+1200.1313315563380444000.00000023
after being stale for 60035ms
21:54:48,444  WARN NaiveFileWALManager:789 - There was a race that happend with SENT vs SENDING
states
21:54:48,444  INFO WALAckManager:153 - Retransmitting 20110814-214053048+1200.1313314853048500000.00000023
after being stale for 60035ms
21:54:48,444  WARN NaiveFileWALManager:789 - There was a race that happend with SENT vs SENDING
states
21:55:48,478  INFO WALAckManager:153 - Retransmitting 20110814-215243380+1200.1313315563380444000.00000023
after being stale for 60034ms
21:55:48,479  WARN NaiveFileWALManager:789 - There was a race that happend with SENT vs SENDING
states

I noticed nothing in the logs for coll. 


My questions are:

1) Am I going about this test the right way ?

2) For the first case, should the agen process retry ? It sounds a little like this https://issues.apache.org/jira/browse/FLUME-711


3) Are the "NaiveFileWALManager shutting down" messages as bad as the sound ? I'm guessing
not because they are INFO.  

4) Is there a timeout to set to avoid "ERROR DirectDriver:311 - Expected ACTIVE but timed
out in state OPENING" ? 

5) "INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed"  I checked lsof with the
processes still running and saw
202-126-206-214:tmp aaron$ lsof -i -P | grep 35853
java      44571 aaron  113u  IPv6 0xffffff8017bec600      0t0  TCP *:35853 (LISTEN)

6) "WARN NaiveFileWALManager:789 - There was a race that happend with SENT vs SENDING states"
Is this a known issue ? Am I seeing this ? https://issues.apache.org/jira/browse/FLUME-637


7) "INFO WALAckManager:153 -  Retransmitting 20110814-214053048+1200.1313314853048500000.00000023
after being stale for 60035ms" Went on as along as I left the agen process running, there
was nothing been displayed on the coll though. Is this also part of FLUME-637 ? 

Thanks for any help. 
 
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com


Mime
View raw message