flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Vachon <vac...@sessionm.com>
Subject Re: Unexplainable Bug in Flume Collectors
Date Fri, 27 Jan 2012 15:56:32 GMT
I did a packet capture I was just able to analyze.  I see the TCP handshake, some PSH,ACK's
then the collector send a RST.  This is not a pattern I see in any other flow on that box.
 It also is observed on my 2nd collector.  I have checked file limits/open descriptors, nothing
is above its limits.


On Jan 27, 2012, at 10:29 AM, Thomas Vachon wrote:

> I have 10 logical collectors per a collector node.  2 for each log file I monitor (one
for HDFS, one for S3 sinks).  I recently went from 8 to 10.  The 10th sink is failing 100%
of the time.  On a node I see:
> 
> 2012-01-27 15:19:05,104 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager:
opening log file 20120127-142931487+0000.6301485683111842.00000106
> 2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.debug.StubbornAppendSink: append
failed on event 'ip-10-212-145-75.ec2.internal [INFO Fri Jan 27 14:29:31 UTC 2012] { AckChecksum
: (long)1327674571487  (string) '5?:?' (double)6.559583946287E-312 } { AckTag : 20120127-142931487+0000.6301485683111842.00000106
} { AckType : beg } ' with error: Append failed java.net.SocketException: Broken pipe
> 2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink
on port 35862 closed
> 2012-01-27 15:19:05,106 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink
on port 35862 closed
> 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink
to admin.internal.sessionm.com:35862 opened
> 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink
to ip-10-194-66-32.ec2.internal:35862 opened
> 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator:
Opened BackoffFailover on try 0
> 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-142931487+0000.6301485683111842.00000106
is queued to be checked
> 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.WALSource: end of file
NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
> 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager:
opening log file 20120127-143151597+0000.6301625792799805.00000106
> 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-143151597+0000.6301625792799805.00000106
is queued to be checked
> 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.WALSource: end of file
NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
> 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager:
opening log file 20120127-151120751+0000.6303994947346458.00000351
> 2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-151120751+0000.6303994947346458.00000351
is queued to be checked
> 2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.durability.WALSource: end of file
NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
> 
> 
> On the collector, I see flow 9 (the HDFS sink flow of the same log file) working just
fine.  I see that it opens the s3n sink for the S3 flow, but no data is being ingested.  On
the node I see that we are seeing "Broken Pipe".  I suspect that is the problem, but I am
unable to find a way to fix it.  I confirmed connectivity via telnet to the RCP source port.
> 
> I have exhausted all possible measure of fixing this.  I have unmapped, configured, and
remapped every node to ensure we did not have a weird problem.  The master shows no errors,
and 9 out of the 10 flows are working just as they should.
> 
> Does anyone have an idea?
> 
> --
> Thomas Vachon
> Principal Operations Architect
> session M
> vachon@sessionm.com
> 
> 


Mime
View raw message