flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noel Duffy <noel.du...@sli-systems.com>
Subject RE: Architecting Flume for failover
Date Wed, 20 Feb 2013 07:27:09 GMT
That makes sense. Thanks for the detailed reply. And thanks to Hari, Jeff, and everyone else
who took time to answer.

From: Juhani Connolly [juhani_connolly@cyberagent.co.jp]
Sent: Wednesday, 20 February 2013 7:27 p.m.
To: user@flume.apache.org
Subject: Re: Architecting Flume for failover

Sink-groups are a concept local to a single agent. What is happening to
you is both your agents have the same sink group, and they're both
selecting the highest priority sink.

Agents are independent and communicate with each other via sink-source
couplings such as avro-sink to avro-source(or thrift).

Fail-over and load balancing in sink groups is intended to be a
mechanism that redirects traffic when a particular sink fails. So if say
you have two hdfs clusters, and one fails, you can reroute your data to
the secondary one until it recovers.

If you want to failover agents,  what you likely want to do to
guarrantee immediate delivery is just set up two agents that are both
directing the data to your hdfs cluster. Then when processing the data
filter out duplicates.

If this isn't workable, more complex methods might involve adding
sequence numbers to the data, and another flume agent with an
interceptor that filters out duplicates based on the sequence numbers.
This would be rather complex, and isn't something built into flume.

On 02/20/2013 03:07 PM, Noel Duffy wrote:
> I think we're talking at cross purposes here. First, let me re-state the problem at a
high level. Hopefully this will be clearer:
> I want to have two or more Flume agents running on different hosts reading events from
the same source (RabbitMQ) but only writing the event to the final sink (hdfs) once. Thus,
if someone kicks out a cable and one of the Flume hosts dies, the other(s) should take over
seamlessly without the loss of any events. I thought that failover and load-balancer sinkgroups
were created to solve this kind of problem, but the documentation only covers the case where
all the sinks in the sinkgroup are on one host, and this does not give me the redundancy I
> This is how I tried to solve the problem. I set up two hosts running Flume with the same
configuration. Thinking that a failover sinkgroup was the right approach to tackling my problem,
I created a sinkgroup and put two hdfs sinks in it, hdfsSink-1 and hdfsSink-2. The idea was
that hdfsSink-1 would be on Flume host A and hdfsSink-2 would be on Flume host B. Then, events
arrive on both host A and host B, and host A would write the event to hdfsSink-2, sending
it across the network to Flume Host B, while host B would write the same event to hdfsSink-2
which is local to it.  Both agents should, in theory, write the event to the same sink on
Flume Host B. Yes, I know that this would still duplicates the events, but I figured I would
worry about that later. However, this has not worked as I expected. Flume Host A does not
pass anything to Flume Host B.
> I need to know if the approach I've adopted, sinkgroups and failover, can achieve the
kind of multi-host redundancy I want. If they can, are there examples of such configurations
that I can see? If they cannot, what kind of configuration would be suitable for what I want
to achieve?
> From: Hari Shreedharan [mailto:hshreedharan@cloudera.com]
> Sent: Wednesday, 20 February 2013 5:39 p.m.
> To: user@flume.apache.org
> Subject: Re: Architecting Flume for failover
> Also, as Jeff said, sink-2 has a higher priority (the absolute value of the priority
being higher, that sink is picked up).
> --
> Hari Shreedharan
> On Tuesday, February 19, 2013 at 8:37 PM, Hari Shreedharan wrote:
> No, it does not mean that. To talk to different HDFS clusters you must specify the hdfs.path
as hdfs://namenode:port/<path>. You don't need to specify the bind etc.
> Hope this helps.
> Hari
> --
> Hari Shreedharan
> On Tuesday, February 19, 2013 at 8:18 PM, Noel Duffy wrote:
> Hari Shreedharan [mailto:hshreedharan@cloudera.com] wrote:
> The "bind" configuration param does not really exist for HDFS Sink (it is only for the
IPC sources).
> Does this mean that failover for sinks on different hosts cannot work for HDFS sinks
at all? Does it require Avro sinks, which seem to have a hostname parameter?

View raw message