Return-Path: X-Original-To: apmail-incubator-flume-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-flume-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9F2757352 for ; Mon, 15 Aug 2011 18:18:26 +0000 (UTC) Received: (qmail 38463 invoked by uid 500); 15 Aug 2011 18:18:26 -0000 Delivered-To: apmail-incubator-flume-user-archive@incubator.apache.org Received: (qmail 38364 invoked by uid 500); 15 Aug 2011 18:18:25 -0000 Mailing-List: contact flume-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: flume-user@incubator.apache.org Delivered-To: mailing list flume-user@incubator.apache.org Received: (qmail 38354 invoked by uid 99); 15 Aug 2011 18:18:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Aug 2011 18:18:25 +0000 X-ASF-Spam-Status: No, hits=-0.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mjlai09@gmail.com designates 209.85.213.175 as permitted sender) Received: from [209.85.213.175] (HELO mail-yx0-f175.google.com) (209.85.213.175) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Aug 2011 18:18:17 +0000 Received: by yxi19 with SMTP id 19so3238352yxi.6 for ; Mon, 15 Aug 2011 11:17:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=KbWyYTn9iJNDik+LEACTpzz12fbHvU1EcqNTd4YK2fs=; b=wSAJ7LdFXtdfPWMhf53JhALcZa7LDS8DJfof7uH7UnAdL9/3vtPpwPGgs5iEBu8EI0 EH56FUz3xIMPVKCzjuFAf4PW94ztdHkx3UCi/Rr5Gub7tPIMRRmRSWT17oSkZyNN4+bW SIIWfHPOrP8krS23Zg4KZCbtDBrIs31HcrQGo= Received: by 10.101.217.14 with SMTP id u14mr4132754anq.10.1313432275475; Mon, 15 Aug 2011 11:17:55 -0700 (PDT) Received: from [10.52.44.34] ([216.207.36.222]) by mx.google.com with ESMTPS id 9sm5341015and.25.2011.08.15.11.17.53 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 15 Aug 2011 11:17:54 -0700 (PDT) Message-ID: <4E4962D0.80407@googlemail.com> Date: Mon, 15 Aug 2011 11:17:52 -0700 From: Mingjie Lai User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110424 Thunderbird/3.1.10 MIME-Version: 1.0 To: flume-user@incubator.apache.org Subject: Re: race conditions and WAL shutdown References: <6AED1CAF-1552-4D83-8BC5-75E23376D229@thelastpickle.com> In-Reply-To: <6AED1CAF-1552-4D83-8BC5-75E23376D229@thelastpickle.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org Saw the problem before. The race condition warning really confused me at the very beginning, but it turned out to be a configure issue. Please take a look at these threads: https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/c35f07463c5c2721/0f63f323f066a827?lnk=gst&q=agente2esink+#0f63f323f066a827 https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/64e786384928a505/f023c918081e8e48?lnk=gst&q=agente2esink+#f023c918081e8e48 In short, you need to use collector wrapper for agentE2ESink. Thanks, Mingjie On 08/14/2011 03:18 AM, aaron morton wrote: > I'm starting to play with Flume and running into some issues. I've found > a couple of tickets that talk about similar things, but wanted to check > if it was a problem on my side� > > Using 0.9.4 bin distro on OSX lion, all local, Java 1.6.0_26 > > start a master and bin/flume node_nowatch -n agen and bin/flume > node_nowatch -n coll > > Then add the following config via the web portal for the master (am > using batch / unbatch as a test for aggregating data later)� > > agen : text("/my/file") | agentSink("localhost",35853) ; > coll : collectorSource(35853) | batch(100) unbatch console ; > > On the agen node it goes bang and shuts it's self down� > > 21:40:53,064 INFO NaiveFileWALManager:179 - NaiveFileWALManager is now open > 21:40:53,066 INFO InsistentOpenDecorator:128 - open attempt 0 failed, > backoff (1000ms): Failed to open thrift event sink to localhost:35853 : > java.net.ConnectException: Connection refused > 21:40:54,063 ERROR DirectDriver:311 - Expected ACTIVE but timed out in > state OPENING > 21:40:54,069 INFO ThriftEventSink:112 - ThriftEventSink to > localhost:35853 opened > 21:40:54,069 INFO InsistentOpenDecorator:116 - Opened ThriftEventSink on > try 1 > 21:40:54,069 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting > down > 21:40:54,069 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed > 21:40:54,218 INFO TextFileSource:70 - File > /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small > closed > 21:40:54,218 INFO RollSink:206 - closing RollSink 'ackingWal' > 21:40:54,219 INFO SeqfileEventSink:92 - closed > /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023 > 21:40:54,219 INFO AckListener$Empty:45 - Empty Ack Listener ended > 20110814-214053048+1200.1313314853048500000.00000023 > 21:40:54,220 INFO NaiveFileWALManager:474 - File lives in > /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023 > 21:40:54,220 WARN NaiveFileWALManager:205 - Already shutting down, but > getting another shutting down notice, odd > 21:40:54,221 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting > down > > I waited for a while, no retry. So killed both processes and started the > coll then the agen process. The lines from the source file were > delivered and this is what I see in the agen process� > > 21:52:43,397 INFO NaiveFileWALManager:760 - opening log file > 20110814-214053048+1200.1313314853048500000.00000023 > 21:52:43,605 INFO TextFileSource:70 - File > /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small > closed > 21:52:43,605 INFO RollSink:206 - closing RollSink 'ackingWal' > 21:52:43,608 INFO SeqfileEventSink:92 - closed > /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023 > 21:52:43,609 INFO AckListener$Empty:45 - Empty Ack Listener ended > 20110814-215243380+1200.1313315563380444000.00000023 > 21:52:43,609 INFO NaiveFileWALManager:474 - File lives in > /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023 > 21:52:43,610 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting > down > 21:52:43,952 INFO WALAckManager:92 - Ack for > 20110814-214053048+1200.1313314853048500000.00000023 is queued to be checked > 21:52:43,952 INFO WALSource:97 - end of file NaiveFileWALManager > (dir=/tmp/flume-aaron/agent/agen ) > 21:52:43,952 INFO NaiveFileWALManager:760 - opening log file > 20110814-215243380+1200.1313315563380444000.00000023 > 21:52:44,110 ERROR DirectDriver:311 - Expected IDLE but timed out in > state ACTIVE > 21:52:44,111 INFO NaiveFileWALDeco:185 - Closing wal log, subsink still > making progress > 21:52:44,316 INFO WALAckManager:92 - Ack for > 20110814-215243380+1200.1313315563380444000.00000023 is queued to be checked > 21:52:44,316 INFO WALSource:97 - end of file NaiveFileWALManager > (dir=/tmp/flume-aaron/agent/agen ) > 21:52:44,517 WARN NaiveFileWALManager:205 - Already shutting down, but > getting another shutting down notice, odd > 21:52:44,518 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting > down > 21:52:44,518 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed > 21:53:48,409 INFO WALAckManager:153 - Retransmitting > 20110814-215243380+1200.1313315563380444000.00000023 after being stale > for 64093ms > 21:53:48,409 INFO WALAckManager:153 - Retransmitting > 20110814-214053048+1200.1313314853048500000.00000023 after being stale > for 64457ms > 21:54:48,444 INFO WALAckManager:153 - Retransmitting > 20110814-215243380+1200.1313315563380444000.00000023 after being stale > for 60035ms > 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that > happend with SENT vs SENDING states > 21:54:48,444 INFO WALAckManager:153 - Retransmitting > 20110814-214053048+1200.1313314853048500000.00000023 after being stale > for 60035ms > 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that > happend with SENT vs SENDING states > 21:55:48,478 INFO WALAckManager:153 - Retransmitting > 20110814-215243380+1200.1313315563380444000.00000023 after being stale > for 60034ms > 21:55:48,479 WARN NaiveFileWALManager:789 - There was a race that > happend with SENT vs SENDING states > > I noticed nothing in the logs for coll. > > > My questions are: > > 1) Am I going about this test the right way ? > > 2) For the first case, should the agen process retry ? It sounds a > little like this https://issues.apache.org/jira/browse/FLUME-711 > > 3) Are the "NaiveFileWALManager shutting down" messages as bad as the > sound ? I'm guessing not because they are INFO. > > 4) Is there a timeout to set to avoid "ERROR DirectDriver:311 - Expected > ACTIVE but timed out in state OPENING" ? > > 5) "INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed" I > checked lsof with the processes still running and saw > 202-126-206-214:tmp aaron$ lsof -i -P | grep 35853 > java 44571 aaron 113u IPv6 0xffffff8017bec600 0t0 TCP *:35853 (LISTEN) > > 6) "WARN NaiveFileWALManager:789 - There was a race that happend with > SENT vs SENDING states" Is this a known issue ? Am I seeing this ? > https://issues.apache.org/jira/browse/FLUME-637 > > 7) "INFO WALAckManager:153 - Retransmitting > 20110814-214053048+1200.1313314853048500000.00000023 after being stale > for 60035ms" Went on as along as I left the agen process running, there > was nothing been displayed on the coll though. Is this also part of > FLUME-637 ? > > Thanks for any help. > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com >