flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Vachon <vac...@sessionm.com>
Subject Unexplainable Bug in Flume Collectors
Date Fri, 27 Jan 2012 15:29:50 GMT
I have 10 logical collectors per a collector node.  2 for each log file I monitor (one for
HDFS, one for S3 sinks).  I recently went from 8 to 10.  The 10th sink is failing 100% of
the time.  On a node I see:

2012-01-27 15:19:05,104 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening
log file 20120127-142931487+0000.6301485683111842.00000106
2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.debug.StubbornAppendSink: append
failed on event 'ip-10-212-145-75.ec2.internal [INFO Fri Jan 27 14:29:31 UTC 2012] { AckChecksum
: (long)1327674571487  (string) '5?:?' (double)6.559583946287E-312 } { AckTag : 20120127-142931487+0000.6301485683111842.00000106
} { AckType : beg } ' with error: Append failed java.net.SocketException: Broken pipe
2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink
on port 35862 closed
2012-01-27 15:19:05,106 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink
on port 35862 closed
2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink
to admin.internal.sessionm.com:35862 opened
2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink
to ip-10-194-66-32.ec2.internal:35862 opened
2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator: Opened
BackoffFailover on try 0
2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-142931487+0000.6301485683111842.00000106
is queued to be checked
2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager
(dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening
log file 20120127-143151597+0000.6301625792799805.00000106
2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-143151597+0000.6301625792799805.00000106
is queued to be checked
2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager
(dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening
log file 20120127-151120751+0000.6303994947346458.00000351
2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-151120751+0000.6303994947346458.00000351
is queued to be checked
2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager
(dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )


On the collector, I see flow 9 (the HDFS sink flow of the same log file) working just fine.
 I see that it opens the s3n sink for the S3 flow, but no data is being ingested.  On the
node I see that we are seeing "Broken Pipe".  I suspect that is the problem, but I am unable
to find a way to fix it.  I confirmed connectivity via telnet to the RCP source port.

I have exhausted all possible measure of fixing this.  I have unmapped, configured, and remapped
every node to ensure we did not have a weird problem.  The master shows no errors, and 9 out
of the 10 flows are working just as they should.

Does anyone have an idea?

--
Thomas Vachon
Principal Operations Architect
session M
vachon@sessionm.com



Mime
View raw message