flume-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Hsieh <...@cloudera.com>
Subject Re: Flume master NG
Date Fri, 22 Jul 2011 14:13:33 GMT
I'm assuming the goal of inter-DC comms is it for replication and disaster
recovery.  Are there other goals for this?

On Mon, Jul 18, 2011 at 12:08 AM, Eric Sammer <esammer@cloudera.com> wrote:

> A good (and tricky) point. In keeping with the way we've talked about
> Flume, there are two systems that are important. One system is the
> control plane responsible for node health and status, configuration,
> and reporting. The other major system is the data plane, obviously
> responsible for data transfer. I see the utility of both being able to
> span a wide area but I think it's warrants a discussion as to how it
> should work and what guarantees we're interested in making here.
> Major points to cover:
> * If we use ZK sessions and ephemeral nodes for heartbeats and status,
> are we killing ourselves long term? I know ZK can span wide areas
> (i.e. suffer high latency) but it will certainly impact the time it
> takes to detect and recover from errors. I'd like to get input from
> phunt or Henry on this. Another option is to have a secondary layer
> for controlling inter-DC communication and have discreet ZK and
> control planes.

I'd prefer limiting the scope of implementation currently, but having a
design that will allow for greater scope in the future.  More specifically,
I've been thinking that the current flume, with its collectors and agents
would be focused on the single DC case.

Let's consider the "limits" with ZK and emphemral nodes today.  Heartbeating
via ephmeral nodes will crap out when ZK cannot handle the # of connections
or the amount of writes.  Last I've heard this is on the order of 1k
writes/s on a laptop. [1]  and several 10k's from the zk paper [2].  With 1
second ephemeral node refreshes I think this means we could theoretically
have 10k's of nodes.  More practically, this means it should be able to
handle 1k's of nodes with failures.   Right now we use 5s so we could get a
5x bump or more if we make the heartbeats longer.  My understanding is that
are single-DC clusters today that are already at these scales.

It follows that if we ever get into clusters with 10k's or 100k's of
machines we'll probably want to have multiple instances.  If they are in the
same DC they could potentially write to the same HDFS.  If they are
different we have a different problem -- single source and many
destinations.  For this I'm leaning towards a secondary layer of Flume, with
inter-DC agents and inter-DC collectors would likely have more of a pub-sub
model a la facebook's calligraphus (uses hdfs for durability) or yahoo!'s
hedwig (bookkeeper/zk for durability).  This lets the lume 1.0 goal be solid
and scalable many-to-few aggregation a single DC and have a long term flume
2.0 goal being few-to-many inter-dc replication.

 [2] http://www.usenix.org/event/atc10/tech/full_papers/Hunt.pdf

> * The new role of the master. Assuming we can gut ACK path from the
> master and remove it from the critical path, does it matter where it
> lives? Based on the discussions we've had internally and what is
> codified in the doc thus far, the master would only be used to push
> config changes and display status. Really, it just reads from / write
> to ZK. ZK is the true master in the proposal.

The master would also still be the server for the shell and along with the
web interfaces.  A shell could potentially just go to zk directly, but my
first intuition is that there should be one server to make concurrent shell
access easier to understand and manage.

By decoupling the state to ZK, this functionality could be separated.  For
now, the master would remain the "ack node" that could too also eventually
be put into a separate process.  Or, collector nodes could actually publish
acks into ZK.   At the end of the day, if all the state lives in ZK, the
master doesn't even need to talk to nodes -- it could just talk to ZK.  The
master could just watch the status and just manage control plane things like
flow transitions and adapting if more nodes are added ore removed.  If
masters only interact with ZK, I don't think we would even need master
leader election.

Another ramification of completely decoupling for an inter-dc case is that
one master could consults two or more different zk's.

* If we're primarily interested in an inter-DC data plane, is this
> best done by having discreet Flume systems and simply configuring them
> to hand off data to one another? Note this is how most systems work:
> SMTP, modular switches with "virtual" fabric" type functionality, JMS
> broker topologies. Most of the time, there's a discreet system in each
> location and an "uber tool" pushes configurations down to each system
> from a centralized location. Nothing we would do here would preclude
> this from working. In other words, maybe the best way to deal with
> inter-DC data is to have separate Flume installs and have a tool or UI
> that just communicates to N masters and pushes the right config to
> each.
I like the idea of messaging/pubsub style system for inter-DC
communications.  I like the design of Facebook's calligraphus system (treat
HDFS as a pubsub queue using append and mulitple concurrent readers for 3-5
second data latency).

We don't necessarily have to solve all of these problems / answer all
> of these questions. I'm interested in deciding:
> * What an initial re-implementation or re-factoring of the master looks
> like.

I think the discrete steps may be:

1) nodes use zk ephemeral nodes to heart beat
2) translation/auto chaining stuff gets extracted out of master (just a
client to the zk "schema")

* What features we'd like to support long term.

Ideally, the zk "schema" for flume data would be extensible so that extra
features (like chokes, the addition of flows, the additions of ip/port info
etc) can be added on without causing pain.

* What features we want to punt on for the foreseeable future (e.g. > 24
> months)
> On Sat, Jul 16, 2011 at 8:10 PM, NerdyNick <nerdynick@gmail.com> wrote:
> > The only thing I would like to make sure is that the design allow for
> > future work to be done on allowing for multi datacenter.
> >
> > On Fri, Jul 8, 2011 at 3:08 PM, Eric Sammer <esammer@cloudera.com>
> wrote:
> >> Flumers:
> >>
> >> Jon and I had some preliminary discussions (prior to Flume entering
> >> the incubator) about a redesign of flume's master. The short version
> >> is that:
> >>
> >> 1. The current multimaster doesn't work correctly in all cases (e.g.
> >> autochains).
> >> 2. State exists in the master's memory that isn't in ZK (so failover
> >> isn't simple)
> >> 3. All the heartbeat / RPC code is complicated.
> >>
> >> The short version of master NG would be:
> >>
> >> 1. All master state gets pushed into ZK
> >> 2. Nodes no longer heartbeat to a specific master. Instead, they
> >> connect to ZK and use ephemeral nodes to indicate health.
> >> 3. Masters judge health and push configuration by poking state into
> >> each node's respective ZK namespace.
> >>
> >> There's a separate discussion around changing how ACKs work (so they
> >> don't pass through the master) but we're going to separate that for
> >> now. This is just a heads up that I'm going to start poking bits into
> >> the new Apache Flume wiki. The discussion is open to all and both Jon
> >> and I would love to hear from users, contrib'ers, and dev'ers alike.
> >> I'm also going to try and rope in a ZK ninja to talk about potential
> >> issues they may cause / things to be aware of once we have something
> >> in the wiki to point at.
> >>
> >> --
> >> Eric Sammer
> >> twitter: esammer
> >> data: www.cloudera.com
> >>
> >
> >
> >
> > --
> > Nick Verbeck - NerdyNick
> > ----------------------------------------------------
> > NerdyNick.com
> > Coloco.ubuntu-rocks.org
> >
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com

// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message