Return-Path: X-Original-To: apmail-incubator-flume-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-flume-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7AB0A91AC for ; Fri, 27 Jan 2012 15:30:23 +0000 (UTC) Received: (qmail 68177 invoked by uid 500); 27 Jan 2012 15:30:23 -0000 Delivered-To: apmail-incubator-flume-user-archive@incubator.apache.org Received: (qmail 68048 invoked by uid 500); 27 Jan 2012 15:30:22 -0000 Mailing-List: contact flume-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: flume-user@incubator.apache.org Delivered-To: mailing list flume-user@incubator.apache.org Received: (qmail 68038 invoked by uid 99); 27 Jan 2012 15:30:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Jan 2012 15:30:22 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of vachon@sessionm.com designates 209.85.220.175 as permitted sender) Received: from [209.85.220.175] (HELO mail-vx0-f175.google.com) (209.85.220.175) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Jan 2012 15:30:14 +0000 Received: by vcbfk26 with SMTP id fk26so1335559vcb.6 for ; Fri, 27 Jan 2012 07:29:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sessionm.com; s=google; h=from:content-type:subject:date:message-id:to:mime-version:x-mailer; bh=Rlg9q2n/Vz5Pt2kI3L1Y8e0gJ8In2Z3lc6J4vREFQRQ=; b=PFwsLcXJrnWyL429/SrnsrQiXstiUIdVhNhyiwSvw1B71RZwGr+KY5iesWThLydPu/ kgDK+PFosTcymM2YaJ9KTm3m4MZp0NnFOx9p1XcbYCaXtcMnjlyLTl9Q4qyakjOibQIb v/Kxt1e/EHydagy0y7x13LKNw3S47Q9NSXsfA= Received: by 10.220.155.132 with SMTP id s4mr3799200vcw.38.1327678191720; Fri, 27 Jan 2012 07:29:51 -0800 (PST) Received: from vachon-imac.i.sessionm.com (vpn.sessionm.com. [50.78.19.137]) by mx.google.com with ESMTPS id a19sm7391964vdh.12.2012.01.27.07.29.51 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 27 Jan 2012 07:29:51 -0800 (PST) From: Thomas Vachon Content-Type: multipart/signed; boundary="Apple-Mail=_25E28AA2-7F85-489C-A562-72DED3B5AAD5"; protocol="application/pgp-signature"; micalg=pgp-sha1 Subject: Unexplainable Bug in Flume Collectors Date: Fri, 27 Jan 2012 10:29:50 -0500 Message-Id: To: flume-user@incubator.apache.org Mime-Version: 1.0 (Apple Message framework v1251.1) X-Mailer: Apple Mail (2.1251.1) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_25E28AA2-7F85-489C-A562-72DED3B5AAD5 Content-Type: multipart/alternative; boundary="Apple-Mail=_7BF97CC1-F343-41A0-B43A-B79532849114" --Apple-Mail=_7BF97CC1-F343-41A0-B43A-B79532849114 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii I have 10 logical collectors per a collector node. 2 for each log file = I monitor (one for HDFS, one for S3 sinks). I recently went from 8 to = 10. The 10th sink is failing 100% of the time. On a node I see: 2012-01-27 15:19:05,104 INFO = com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log = file 20120127-142931487+0000.6301485683111842.00000106 2012-01-27 15:19:05,105 INFO = com.cloudera.flume.handlers.debug.StubbornAppendSink: append failed on = event 'ip-10-212-145-75.ec2.internal [INFO Fri Jan 27 14:29:31 UTC 2012] = { AckChecksum : (long)1327674571487 (string) '5?:?' = (double)6.559583946287E-312 } { AckTag : = 20120127-142931487+0000.6301485683111842.00000106 } { AckType : beg } ' = with error: Append failed java.net.SocketException: Broken pipe 2012-01-27 15:19:05,105 INFO = com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on = port 35862 closed 2012-01-27 15:19:05,106 INFO = com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on = port 35862 closed 2012-01-27 15:19:05,108 INFO = com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to = admin.internal.sessionm.com:35862 opened 2012-01-27 15:19:05,108 INFO = com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to = ip-10-194-66-32.ec2.internal:35862 opened 2012-01-27 15:19:05,108 INFO = com.cloudera.flume.handlers.debug.InsistentOpenDecorator: Opened = BackoffFailover on try 0 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.WALAckManager: Ack = for 20120127-142931487+0000.6301485683111842.00000106 is queued to be = checked 2012-01-27 15:19:05,109 INFO = com.cloudera.flume.agent.durability.WALSource: end of file = NaiveFileWALManager = (dir=3D/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 ) 2012-01-27 15:19:05,109 INFO = com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log = file 20120127-143151597+0000.6301625792799805.00000106 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.WALAckManager: Ack = for 20120127-143151597+0000.6301625792799805.00000106 is queued to be = checked 2012-01-27 15:19:05,110 INFO = com.cloudera.flume.agent.durability.WALSource: end of file = NaiveFileWALManager = (dir=3D/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 ) 2012-01-27 15:19:05,110 INFO = com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log = file 20120127-151120751+0000.6303994947346458.00000351 2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.WALAckManager: Ack = for 20120127-151120751+0000.6303994947346458.00000351 is queued to be = checked 2012-01-27 15:19:05,111 INFO = com.cloudera.flume.agent.durability.WALSource: end of file = NaiveFileWALManager = (dir=3D/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 ) On the collector, I see flow 9 (the HDFS sink flow of the same log file) = working just fine. I see that it opens the s3n sink for the S3 flow, = but no data is being ingested. On the node I see that we are seeing = "Broken Pipe". I suspect that is the problem, but I am unable to find a = way to fix it. I confirmed connectivity via telnet to the RCP source = port. I have exhausted all possible measure of fixing this. I have unmapped, = configured, and remapped every node to ensure we did not have a weird = problem. The master shows no errors, and 9 out of the 10 flows are = working just as they should. Does anyone have an idea? -- Thomas Vachon Principal Operations Architect session M vachon@sessionm.com --Apple-Mail=_7BF97CC1-F343-41A0-B43A-B79532849114 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii I = have 10 logical collectors per a collector node.  2 for each log = file I monitor (one for HDFS, one for S3 sinks).  I recently went = from 8 to 10.  The 10th sink is failing 100% of the time.  On = a node I see:

2012-01-27 15:19:05,104 INFO = com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log = file = 20120127-142931487+0000.6301485683111842.00000106
2012-01-27 = 15:19:05,105 INFO com.cloudera.flume.handlers.debug.StubbornAppendSink: = append failed on event 'ip-10-212-145-75.ec2.internal [INFO Fri Jan 27 = 14:29:31 UTC 2012] { AckChecksum : (long)1327674571487  (string) = '5?:?' (double)6.559583946287E-312 } { AckTag : = 20120127-142931487+0000.6301485683111842.00000106 } { AckType : beg } ' = with error: Append failed java.net.SocketException: Broken = pipe
2012-01-27 15:19:05,105 INFO = com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on = port 35862 closed
2012-01-27 15:19:05,106 INFO = com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on = port 35862 closed
2012-01-27 15:19:05,108 INFO = com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to = admin.internal.sessionm.com:35862 opened
2012-01-27 = 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: = ThriftEventSink to ip-10-194-66-32.ec2.internal:35862 = opened
2012-01-27 15:19:05,108 INFO = com.cloudera.flume.handlers.debug.InsistentOpenDecorator: Opened = BackoffFailover on try 0
2012-01-27 15:19:05,109 INFO = com.cloudera.flume.agent.WALAckManager: Ack for = 20120127-142931487+0000.6301485683111842.00000106 is queued to be = checked
2012-01-27 15:19:05,109 INFO = com.cloudera.flume.agent.durability.WALSource: end of file = NaiveFileWALManager = (dir=3D/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 = )
2012-01-27 15:19:05,109 INFO = com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log = file = 20120127-143151597+0000.6301625792799805.00000106
2012-01-27 = 15:19:05,110 INFO com.cloudera.flume.agent.WALAckManager: Ack for = 20120127-143151597+0000.6301625792799805.00000106 is queued to be = checked
2012-01-27 15:19:05,110 INFO = com.cloudera.flume.agent.durability.WALSource: end of file = NaiveFileWALManager = (dir=3D/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 = )
2012-01-27 15:19:05,110 INFO = com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log = file = 20120127-151120751+0000.6303994947346458.00000351
2012-01-27 = 15:19:05,111 INFO com.cloudera.flume.agent.WALAckManager: Ack for = 20120127-151120751+0000.6303994947346458.00000351 is queued to be = checked
2012-01-27 15:19:05,111 INFO = com.cloudera.flume.agent.durability.WALSource: end of file = NaiveFileWALManager = (dir=3D/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 = )


On the collector, I see = flow 9 (the HDFS sink flow of the same log file) working just fine. =  I see that it opens the s3n sink for the S3 flow, but no data is = being ingested.  On the node I see that we are seeing "Broken = Pipe".  I suspect that is the problem, but I am unable to find a = way to fix it.  I confirmed connectivity via telnet to the RCP = source port.

I have exhausted all possible = measure of fixing this.  I have unmapped, configured, and remapped = every node to ensure we did not have a weird problem.  The master = shows no errors, and 9 out of the 10 flows are working just as they = should.

Does anyone have an idea?

--
Thomas = Vachon
Principal Operations Architect
session M