Mailing-List: contact flume-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: flume-user@incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of vachon@sessionm.com designates
 209.85.220.175 as permitted sender)
From: Thomas Vachon <vachon@sessionm.com>
Content-Type: multipart/signed;
 boundary="Apple-Mail=_25E28AA2-7F85-489C-A562-72DED3B5AAD5";
 protocol="application/pgp-signature"; micalg=pgp-sha1
Subject: Unexplainable Bug in Flume Collectors
Date: Fri, 27 Jan 2012 10:29:50 -0500
Message-Id: <AB677535-A761-469B-816B-45DED39C1425@sessionm.com>
To: flume-user@incubator.apache.org
Mime-Version: 1.0 (Apple Message framework v1251.1)


--Apple-Mail=_25E28AA2-7F85-489C-A562-72DED3B5AAD5
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_7BF97CC1-F343-41A0-B43A-B79532849114"


--Apple-Mail=_7BF97CC1-F343-41A0-B43A-B79532849114
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

I have 10 logical collectors per a collector node.  2 for each log file =
I monitor (one for HDFS, one for S3 sinks).  I recently went from 8 to =
10.  The 10th sink is failing 100% of the time.  On a node I see:

2012-01-27 15:19:05,104 INFO =
com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log =
file 20120127-142931487+0000.6301485683111842.00000106
2012-01-27 15:19:05,105 INFO =
com.cloudera.flume.handlers.debug.StubbornAppendSink: append failed on =
event 'ip-10-212-145-75.ec2.internal [INFO Fri Jan 27 14:29:31 UTC 2012] =
{ AckChecksum : (long)1327674571487  (string) '5?:?' =
(double)6.559583946287E-312 } { AckTag : =
20120127-142931487+0000.6301485683111842.00000106 } { AckType : beg } ' =
with error: Append failed java.net.SocketException: Broken pipe
2012-01-27 15:19:05,105 INFO =
com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on =
port 35862 closed
2012-01-27 15:19:05,106 INFO =
com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on =
port 35862 closed
2012-01-27 15:19:05,108 INFO =
com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to =
admin.internal.sessionm.com:35862 opened
2012-01-27 15:19:05,108 INFO =
com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to =
ip-10-194-66-32.ec2.internal:35862 opened
2012-01-27 15:19:05,108 INFO =
com.cloudera.flume.handlers.debug.InsistentOpenDecorator: Opened =
BackoffFailover on try 0
2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.WALAckManager: Ack =
for 20120127-142931487+0000.6301485683111842.00000106 is queued to be =
checked
2012-01-27 15:19:05,109 INFO =
com.cloudera.flume.agent.durability.WALSource: end of file =
NaiveFileWALManager =
(dir=3D/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
2012-01-27 15:19:05,109 INFO =
com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log =
file 20120127-143151597+0000.6301625792799805.00000106
2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.WALAckManager: Ack =
for 20120127-143151597+0000.6301625792799805.00000106 is queued to be =
checked
2012-01-27 15:19:05,110 INFO =
com.cloudera.flume.agent.durability.WALSource: end of file =
NaiveFileWALManager =
(dir=3D/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
2012-01-27 15:19:05,110 INFO =
com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log =
file 20120127-151120751+0000.6303994947346458.00000351
2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.WALAckManager: Ack =
for 20120127-151120751+0000.6303994947346458.00000351 is queued to be =
checked
2012-01-27 15:19:05,111 INFO =
com.cloudera.flume.agent.durability.WALSource: end of file =
NaiveFileWALManager =
(dir=3D/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )


On the collector, I see flow 9 (the HDFS sink flow of the same log file) =
working just fine.  I see that it opens the s3n sink for the S3 flow, =
but no data is being ingested.  On the node I see that we are seeing =
"Broken Pipe".  I suspect that is the problem, but I am unable to find a =
way to fix it.  I confirmed connectivity via telnet to the RCP source =
port.

I have exhausted all possible measure of fixing this.  I have unmapped, =
configured, and remapped every node to ensure we did not have a weird =
problem.  The master shows no errors, and 9 out of the 10 flows are =
working just as they should.

Does anyone have an idea?

--
Thomas Vachon
Principal Operations Architect
session M
vachon@sessionm.com


--Apple-Mail=_7BF97CC1-F343-41A0-B43A-B79532849114
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=us-ascii

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">I =
have 10 logical collectors per a collector node. &nbsp;2 for each log =
file I monitor (one for HDFS, one for S3 sinks). &nbsp;I recently went =
from 8 to 10. &nbsp;The 10th sink is failing 100% of the time. &nbsp;On =
a node I see:<div><br></div><div><div>2012-01-27 15:19:05,104 INFO =
com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log =
file =
20120127-142931487+0000.6301485683111842.00000106</div><div>2012-01-27 =
15:19:05,105 INFO com.cloudera.flume.handlers.debug.StubbornAppendSink: =
append failed on event 'ip-10-212-145-75.ec2.internal [INFO Fri Jan 27 =
14:29:31 UTC 2012] { AckChecksum : (long)1327674571487 &nbsp;(string) =
'5?:?' (double)6.559583946287E-312 } { AckTag : =
20120127-142931487+0000.6301485683111842.00000106 } { AckType : beg } ' =
with error: Append failed java.net.SocketException: Broken =
pipe</div><div>2012-01-27 15:19:05,105 INFO =
com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on =
port 35862 closed</div><div>2012-01-27 15:19:05,106 INFO =
com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on =
port 35862 closed</div><div>2012-01-27 15:19:05,108 INFO =
com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to =
admin.internal.sessionm.com:35862 opened</div><div>2012-01-27 =
15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: =
ThriftEventSink to ip-10-194-66-32.ec2.internal:35862 =
opened</div><div>2012-01-27 15:19:05,108 INFO =
com.cloudera.flume.handlers.debug.InsistentOpenDecorator: Opened =
BackoffFailover on try 0</div><div>2012-01-27 15:19:05,109 INFO =
com.cloudera.flume.agent.WALAckManager: Ack for =
20120127-142931487+0000.6301485683111842.00000106 is queued to be =
checked</div><div>2012-01-27 15:19:05,109 INFO =
com.cloudera.flume.agent.durability.WALSource: end of file =
NaiveFileWALManager =
(dir=3D/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 =
)</div><div>2012-01-27 15:19:05,109 INFO =
com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log =
file =
20120127-143151597+0000.6301625792799805.00000106</div><div>2012-01-27 =
15:19:05,110 INFO com.cloudera.flume.agent.WALAckManager: Ack for =
20120127-143151597+0000.6301625792799805.00000106 is queued to be =
checked</div><div>2012-01-27 15:19:05,110 INFO =
com.cloudera.flume.agent.durability.WALSource: end of file =
NaiveFileWALManager =
(dir=3D/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 =
)</div><div>2012-01-27 15:19:05,110 INFO =
com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log =
file =
20120127-151120751+0000.6303994947346458.00000351</div><div>2012-01-27 =
15:19:05,111 INFO com.cloudera.flume.agent.WALAckManager: Ack for =
20120127-151120751+0000.6303994947346458.00000351 is queued to be =
checked</div><div>2012-01-27 15:19:05,111 INFO =
com.cloudera.flume.agent.durability.WALSource: end of file =
NaiveFileWALManager =
(dir=3D/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 =
)</div></div><div><br></div><div><br></div><div>On the collector, I see =
flow 9 (the HDFS sink flow of the same log file) working just fine. =
&nbsp;I see that it opens the s3n sink for the S3 flow, but no data is =
being ingested. &nbsp;On the node I see that we are seeing "Broken =
Pipe". &nbsp;I suspect that is the problem, but I am unable to find a =
way to fix it. &nbsp;I confirmed connectivity via telnet to the RCP =
source port.</div><div><br></div><div>I have exhausted all possible =
measure of fixing this. &nbsp;I have unmapped, configured, and remapped =
every node to ensure we did not have a weird problem. &nbsp;The master =
shows no errors, and 9 out of the 10 flows are working just as they =
should.</div><div><br></div><div>Does anyone have an idea?</div><br><div =
apple-content-edited=3D"true">
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: =
0px; text-transform: none; white-space: normal; widows: 2; word-spacing: =
0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-align: -webkit-auto; text-indent: 0px; text-transform: none; =
white-space: normal; widows: 2; word-spacing: 0px; =
-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: =
0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"font-weight: normal; word-wrap: break-word; -webkit-nbsp-mode: =
space; -webkit-line-break: after-white-space; ">--<br>Thomas =
Vachon<br>Principal Operations Architect<br><b>session M</b></div><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><a =
href=3D"mailto:vachon@sessionm.com" style=3D"font-weight: normal; =
">vachon@sessionm.com<br></a><br></div></span></span>
</div>
<br></body></html>=

--Apple-Mail=_7BF97CC1-F343-41A0-B43A-B79532849114--

--Apple-Mail=_25E28AA2-7F85-489C-A562-72DED3B5AAD5
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
	filename=signature.asc
Content-Type: application/pgp-signature;
	name=signature.asc
Content-Description: Message signed with OpenPGP using GPGMail

-----BEGIN PGP SIGNATURE-----

iQEcBAEBAgAGBQJPIsLuAAoJEPsQo7lvat3OGWIH/jZM76ozemxMBjGz1T9PsRL5
NXVxo1dD/Igp4bFm+mm9gfiY1H5Dj5dpQPdzcuVxol0i8VAN2Z9Fvc+xLmuXuoJq
eTZbHTVTDhENo611umwDPZhEnDX1roiMbPWyMYIEoBzgiJaQ9Xg7vuiGttavAOEJ
eAkbYxdx0717v2Ott8KxW6O4TMkqUZx3RfHXHj7bNHE9+p5a42JhjpHDchvt0ieb
p44HvCY6AxPj0pifL5i0s/M8eBluSjcVzzr3uLERfKKnkBG8NprhvAk+KTGat9ma
MrEKXZpUxrnwC/D35FHswPqEZ2ShMzgqtlw6yeIiV6eqUb94sBk7zEwpqBpOcJY=
=uEVd
-----END PGP SIGNATURE-----

--Apple-Mail=_25E28AA2-7F85-489C-A562-72DED3B5AAD5--