flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Rathbone <matt...@foursquare.com>
Subject Re: Flume Reliability Issues
Date Fri, 02 Mar 2012 22:30:01 GMT
Hey Dennis, 

We've had a lot of issues with any flume version < 1.0. We lose a lot of data and we get
a lot of deadlocks.

Speaking to Cloudera I'd suggest you try out the flume-NG beta, labelled flume 1.0. It's a
total rewrite and it looks like there are people working on it full time. We're going to be
testing it out in the next few weeks. 

-- 
Matthew Rathbone
Foursquare | Software Engineer | Server Engineering Team
matthew@foursquare.com (mailto:matthew@foursquare.com) | @rathboma (http://twitter.com/rathboma)
| 4sq (http://foursquare.com/rathboma)



On Friday, March 2, 2012 at 4:35 AM, Meyer, Dennis wrote:

> Hi,
> 
> 
> 
> We encountered the following Issues in our development with Flume. We're investigating
the issues currently, but it would be great if someone could sent some feedback if this is

> Work as designed (but maybe misused)
> Known Issues (in that version only?)
> Not supported feature
> 
> 
> Here comes the list with the four issues we have seen: 
> 
> 
> Used Flume version 0.9.4+25.40-1 
> 1) Feature "Duplicate Data" works inconsistently 
> Not all data will be duplicated all the time (usage: send data to a SAN for a full backup
and send data to HDFS) 
> If a receiving node goes out of service, the sending node stops sending data to all receiving
nodes 
> Should a failed receiving node reconnect 
> The sending nodes CPU will go up to 100% usage, meaning that it will stop handling records
from now and as even if the CPU recovers, Flume does not 
> Even if the failed node reconnects, there is a chance that the sending source will not
notice the reconnect. This can only be fixed by a full restart of all involved sending/receiving
nodes
> 
> 
> 
> 
> 
> 
> 2) Flume is unable to recover failed/crashed/lost nodes reliably 
> Often failed nodes get back up, but are not integrated into the data flow anymore(i.e.
a source not knowing that its sink reconnected) 
> A node may be lost, but neither the master nor any connected node know about it 
> A failed node can only be reliably re-introduced into its flow if ALL nodes are restarted
manually! 
> 
> 
> 
> 3) Flume is unable to run the highest reliability mode for records crash free 
> If a node reconnects after a failure, there is a good chance that the master node crashes

> 
> 
> 
> 4) Loosing records on node failure 
> Flume sends up to one thousand records as a batch from source to sink. If the sink failes
on the first record, the other 999 records sometimes get lost. 
> On the highest reliability mode, Flume was unable to reroute records safely through.
As we send data to a node which is or goes out of service, Flume saves this data for later
when the node reconnects. What it should really do is take the events from the failed node
and reroute them accordingly to the defined flow into another node.
> 
> 
> 
> 
> 
> BIG THANKS!
> 
> Dennis
> 
> 



Mime
View raw message