incubator-flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mingjie Lai <mjla...@gmail.com>
Subject Re: race conditions and WAL shutdown
Date Mon, 15 Aug 2011 18:17:52 GMT

Saw the problem before. The race condition warning really confused me at 
the very beginning, but it turned out to be a configure issue. Please 
take a look at these threads:

https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/c35f07463c5c2721/0f63f323f066a827?lnk=gst&q=agente2esink+#0f63f323f066a827

https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/64e786384928a505/f023c918081e8e48?lnk=gst&q=agente2esink+#f023c918081e8e48

In short, you need to use collector wrapper for agentE2ESink.

Thanks,
Mingjie


On 08/14/2011 03:18 AM, aaron morton wrote:
> I'm starting to play with Flume and running into some issues. I've found
> a couple of tickets that talk about similar things, but wanted to check
> if it was a problem on my side…
>
> Using 0.9.4 bin distro on OSX lion, all local, Java 1.6.0_26
>
> start a master and bin/flume node_nowatch -n agen and bin/flume
> node_nowatch -n coll
>
> Then add the following config via the web portal for the master (am
> using batch / unbatch as a test for aggregating data later)…
>
> agen : text("/my/file") | agentSink("localhost",35853) ;
> coll : collectorSource(35853) | batch(100) unbatch console ;
>
> On the agen node it goes bang and shuts it's self down…
>
> 21:40:53,064 INFO NaiveFileWALManager:179 - NaiveFileWALManager is now open
> 21:40:53,066 INFO InsistentOpenDecorator:128 - open attempt 0 failed,
> backoff (1000ms): Failed to open thrift event sink to localhost:35853 :
> java.net.ConnectException: Connection refused
> 21:40:54,063 ERROR DirectDriver:311 - Expected ACTIVE but timed out in
> state OPENING
> 21:40:54,069 INFO ThriftEventSink:112 - ThriftEventSink to
> localhost:35853 opened
> 21:40:54,069 INFO InsistentOpenDecorator:116 - Opened ThriftEventSink on
> try 1
> 21:40:54,069 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
> down
> 21:40:54,069 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
> 21:40:54,218 INFO TextFileSource:70 - File
> /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
> closed
> 21:40:54,218 INFO RollSink:206 - closing RollSink 'ackingWal'
> 21:40:54,219 INFO SeqfileEventSink:92 - closed
> /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
> 21:40:54,219 INFO AckListener$Empty:45 - Empty Ack Listener ended
> 20110814-214053048+1200.1313314853048500000.00000023
> 21:40:54,220 INFO NaiveFileWALManager:474 - File lives in
> /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
> 21:40:54,220 WARN NaiveFileWALManager:205 - Already shutting down, but
> getting another shutting down notice, odd
> 21:40:54,221 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
> down
>
> I waited for a while, no retry. So killed both processes and started the
> coll then the agen process. The lines from the source file were
> delivered and this is what I see in the agen process…
>
> 21:52:43,397 INFO NaiveFileWALManager:760 - opening log file
> 20110814-214053048+1200.1313314853048500000.00000023
> 21:52:43,605 INFO TextFileSource:70 - File
> /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
> closed
> 21:52:43,605 INFO RollSink:206 - closing RollSink 'ackingWal'
> 21:52:43,608 INFO SeqfileEventSink:92 - closed
> /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
> 21:52:43,609 INFO AckListener$Empty:45 - Empty Ack Listener ended
> 20110814-215243380+1200.1313315563380444000.00000023
> 21:52:43,609 INFO NaiveFileWALManager:474 - File lives in
> /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
> 21:52:43,610 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
> down
> 21:52:43,952 INFO WALAckManager:92 - Ack for
> 20110814-214053048+1200.1313314853048500000.00000023 is queued to be checked
> 21:52:43,952 INFO WALSource:97 - end of file NaiveFileWALManager
> (dir=/tmp/flume-aaron/agent/agen )
> 21:52:43,952 INFO NaiveFileWALManager:760 - opening log file
> 20110814-215243380+1200.1313315563380444000.00000023
> 21:52:44,110 ERROR DirectDriver:311 - Expected IDLE but timed out in
> state ACTIVE
> 21:52:44,111 INFO NaiveFileWALDeco:185 - Closing wal log, subsink still
> making progress
> 21:52:44,316 INFO WALAckManager:92 - Ack for
> 20110814-215243380+1200.1313315563380444000.00000023 is queued to be checked
> 21:52:44,316 INFO WALSource:97 - end of file NaiveFileWALManager
> (dir=/tmp/flume-aaron/agent/agen )
> 21:52:44,517 WARN NaiveFileWALManager:205 - Already shutting down, but
> getting another shutting down notice, odd
> 21:52:44,518 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
> down
> 21:52:44,518 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
> 21:53:48,409 INFO WALAckManager:153 - Retransmitting
> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
> for 64093ms
> 21:53:48,409 INFO WALAckManager:153 - Retransmitting
> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
> for 64457ms
> 21:54:48,444 INFO WALAckManager:153 - Retransmitting
> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
> for 60035ms
> 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that
> happend with SENT vs SENDING states
> 21:54:48,444 INFO WALAckManager:153 - Retransmitting
> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
> for 60035ms
> 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that
> happend with SENT vs SENDING states
> 21:55:48,478 INFO WALAckManager:153 - Retransmitting
> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
> for 60034ms
> 21:55:48,479 WARN NaiveFileWALManager:789 - There was a race that
> happend with SENT vs SENDING states
>
> I noticed nothing in the logs for coll.
>
>
> My questions are:
>
> 1) Am I going about this test the right way ?
>
> 2) For the first case, should the agen process retry ? It sounds a
> little like this https://issues.apache.org/jira/browse/FLUME-711
>
> 3) Are the "NaiveFileWALManager shutting down" messages as bad as the
> sound ? I'm guessing not because they are INFO.
>
> 4) Is there a timeout to set to avoid "ERROR DirectDriver:311 - Expected
> ACTIVE but timed out in state OPENING" ?
>
> 5) "INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed" I
> checked lsof with the processes still running and saw
> 202-126-206-214:tmp aaron$ lsof -i -P | grep 35853
> java 44571 aaron 113u IPv6 0xffffff8017bec600 0t0 TCP *:35853 (LISTEN)
>
> 6) "WARN NaiveFileWALManager:789 - There was a race that happend with
> SENT vs SENDING states" Is this a known issue ? Am I seeing this ?
> https://issues.apache.org/jira/browse/FLUME-637
>
> 7) "INFO WALAckManager:153 - Retransmitting
> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
> for 60035ms" Went on as along as I left the agen process running, there
> was nothing been displayed on the coll though. Is this also part of
> FLUME-637 ?
>
> Thanks for any help.
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>

Mime
View raw message