Return-Path: X-Original-To: apmail-flume-user-archive@www.apache.org Delivered-To: apmail-flume-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6A1749B0E for ; Thu, 23 Aug 2012 22:55:32 +0000 (UTC) Received: (qmail 94080 invoked by uid 500); 23 Aug 2012 22:55:32 -0000 Delivered-To: apmail-flume-user-archive@flume.apache.org Received: (qmail 94036 invoked by uid 500); 23 Aug 2012 22:55:32 -0000 Mailing-List: contact user-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flume.apache.org Delivered-To: mailing list user@flume.apache.org Received: (qmail 94028 invoked by uid 99); 23 Aug 2012 22:55:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Aug 2012 22:55:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bhaskarvk@gmail.com designates 209.85.216.174 as permitted sender) Received: from [209.85.216.174] (HELO mail-qc0-f174.google.com) (209.85.216.174) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Aug 2012 22:55:24 +0000 Received: by qcro28 with SMTP id o28so853523qcr.19 for ; Thu, 23 Aug 2012 15:55:03 -0700 (PDT) Received: by 10.224.185.70 with SMTP id cn6mr5661532qab.16.1345762503073; Thu, 23 Aug 2012 15:55:03 -0700 (PDT) Received: from mail-vc0-f179.google.com (mail-vc0-f179.google.com [209.85.220.179]) by mx.google.com with ESMTPS id a13sm6158318qad.18.2012.08.23.15.55.02 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 23 Aug 2012 15:55:02 -0700 (PDT) Received: by vcqp16 with SMTP id p16so1423828vcq.38 for ; Thu, 23 Aug 2012 15:55:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=F0IoIBGVsCgovxpjS8/mEwT5Pop4TzyRXWpqkXsaOlg=; b=bU+mCYlK2k7+tipFIRaGAofBw4WhTLCMmnceANmHod34fmDKb6cPVVsK8pziqaSrGW /Yf2VGjW4SkPPjhG45EGi3mHE+x7UuxrjSFsvwZRl/zREZ0waNSyh9AgQL097cS9Dh/F w2bbpSVJx1X4PuONrGETz6r0DonmhXxefKxsSEUGyBO7ou6dmBlekIw7JIQk3ZozIJbZ HDbc2U88YQkPCu8eA8sjK3ZO4alXabHe7NFX0i0foIFZy7SV+Ohclcq80c4TTmZmOddk y1cKtOzYbVstuvOzC4OLcULjdpottIh7ZZkYdTxf4dZNBp8uo6uOzcaGTM2LQ/viQdIJ NEGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=F0IoIBGVsCgovxpjS8/mEwT5Pop4TzyRXWpqkXsaOlg=; b=KNj9fCrD+oYVOtLb06xi6Hyji2ddqUacNkXsxVEmxMtCq/lUCeuZyaWVdBlxMmc7D2 CVA91NqNwzWQJj5XOkmKl5ORRRB4Osua634chL1vzLIAY0X9IR5h4FVsv6uPvexhsnH7 mxYC8yMz+dig99eewRKgP4DPU0gUSLS9JGDWRyLeJvNVjJjosGFeDf/gBYCs//OVrLtK Vy3w4ZeI0eOb/7FjJnG7Z0nyRrCEmvhj4/AVZheq+qUc8k5BHHRXn7C4zuBcScpwJHqU X5hsPwaA6ohyWu3EpSDuouiSjKQOamI75M3w8rjjkZkLz0Z1uKpAFFIzCbkB+MjWG/Dm Z4cw== MIME-Version: 1.0 Received: by 10.52.72.79 with SMTP id b15mr2435382vdv.13.1345762502130; Thu, 23 Aug 2012 15:55:02 -0700 (PDT) Received: by 10.220.142.73 with HTTP; Thu, 23 Aug 2012 15:55:02 -0700 (PDT) In-Reply-To: <5DE0B0062D5B4F958C2EC2BF9A83CBA3@cloudera.com> References: <01C76B6F725F4689BD94E8BE4E335D72@cloudera.com> <7B848D03C7F14DD582140F48BAF9675D@cloudera.com> <5DE0B0062D5B4F958C2EC2BF9A83CBA3@cloudera.com> Date: Thu, 23 Aug 2012 18:55:02 -0400 Message-ID: Subject: Re: Proper documentation for setting up sink groups From: "Bhaskar V. Karambelkar" To: user@flume.apache.org Content-Type: multipart/alternative; boundary=20cf307f362ee43a5604c7f6bf2e X-Gm-Message-State: ALoCoQlOnBoNkyY9M8qYUt3bn2l3B6gj0Qb00pkcXCbB3yDuXEPVq1BxnCjGznKB6EHWjeVD6tkU --20cf307f362ee43a5604c7f6bf2e Content-Type: text/plain; charset=UTF-8 Some really insightful explanations Hari, thanks for the insight. Btw, I do feel all this should be in flume user guide for the greater good of mankind :) On Thu, Aug 23, 2012 at 6:45 PM, Hari Shreedharan wrote: > Please see inline. > > -- > Hari Shreedharan > > > On Thursday, August 23, 2012 at 3:28 PM, Bhaskar V. Karambelkar wrote: > > > My replies in line. and thanks for the detailed explanations. > > > > On Thu, Aug 23, 2012 at 2:57 PM, Hari Shreedharan < > hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)> wrote: > > > > > > Please see inline. > > > > > > -- > > > Hari Shreedharan > > > > > > > > > On Thursday, August 23, 2012 at 11:43 AM, Bhaskar V. Karambelkar wrote: > > > > > > > Hi Hari, > > > > Yes I did read the whole guide end to end. > > > > But I still have doubts > > > > > > > > The fact that multiple sinks can feed from the same channel is news > to me. I don't see it explicitly mentioned in the docs, > > > > so i guess I assumed wrongly, that only one sink can feed from a > channel. > > > > > > > > a)Can you explain in detail , how having multiple sinks taking > events from one channel, is useful in a "fast source slow sinks" scenario ? > > > When multiple sinks read events from the same channel, you essentially > have as many threads taking events out, since each sink has at least one > thread. So if your source is dumping n events per second into the channel, > and your sink can only process 1 event per second, you could have n sinks > to read n events per second (this is hypothetical - your hardware and your > OS will restrict performance when the number of threads starts growing a > lot). A channel returns an event only once, how many ever sinks are taking > from the channel. Each event if removed and committed will never be given > to another sink. If there is a rollback, it is just like the event was > never taken, and a different sink will be able to take and commit it. > > > > > > > > > OK this makes sense. > > > > > > > > > > b) Also if I read your explanation below correctly there are 3 > possible cases > > > > > > > > 1) multiple sinks feeding from a single channel , with the default > sink processor this will be like a multiplexing channel with all sinks > getting all the events that come in the channel. > > > No, every time a take() is called from the channel, the channel will > return that event only to one sink. So each sink will get a unique > event(unless rollbacks happen - in which case the channel will put the > events back into the channel and a different sink might be able to pick it > up). > > > > > > > > > So this situation is exactly like a load balancing one, as events are > somewhat equally distributed between all sinks ? > Not necessarily equally distributed. Sinks poll the channel to take the > event. If a sink is slow in polling channels then it will get fewer events, > and if a channel is faster then that will get more events, since they are > running on different threads. > > > > > > > > > > 2) multiple sinks feeding from a single channel , with fail_over > sink processor, only one sink will get the events at a give time, with > flume failing over to next available sink in case the first one fails ? > > > A sink group essentially treats n sinks like one, and depending on the > criteria, will select one sink to process the next event from the channel. > In case of failover, sinks are picked in order of priority - and when one > sink fails, the next one is picked. > > > > > > > > > OK this makes sense. > > > > > > > > > > 3) multiple sinks feeding from a single channel, with load balancing > processor, with all sinks getting events in a round-robin/random order. > > > No, each sink will get a different event. One sink processes one event > and the next one picked will process the next event from the channel. > > > > > > Yes that's exactly what I meant, I didn't imply that all sinks get all > events, but the events are distributed more or less equally among the sinks > in round-robin/random order. > > As I said about this looks almost like #1, except here you have a > control over the selection algorithm (round-robin/random) > > Not just that you have control, this will not depend on the sink's > performance because all sinks are run from the same thread. So slower sinks > can slow down the whole process since only one sink reads from the channel > at any point in time. Think of a load balancing sink selector as a loop > which picks up one sink and passes the event to that one. Since there is > only one thread per sink group, having one sink group is often slower than > having multiple sinks reading from the same channel. > > > > > > Is this a correct assumption ? I am aware of #2 and #3, not sure > about #1. > > > > > > > > On Thu, Aug 23, 2012 at 12:43 PM, Hari Shreedharan < > hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com) (mailto: > hshreedharan@cloudera.com)> wrote: > > > > > Did you read this: > http://flume.apache.org/FlumeUserGuide.html#flume-sink-processors > > > > > > > > > > That explains how to use sink groups. Also there is nothing wrong > with multiple sinks taking events from one channel. This is an especially > useful configuration if you have a very fast source and much slower sinks. > > > > > > > > > > > > > > > Hari > > > > > > > > > > -- > > > > > Hari Shreedharan > > > > > > > > > > > > > > > On Thursday, August 23, 2012 at 9:28 AM, Bhaskar V. Karambelkar > wrote: > > > > > > > > > > > The sink group document doesn't mention anything about how > > > > > > to hook up sink groups to the rest of the config in order to > work. > > > > > > > > > > > > e.g. under normal circumstances one channel is linked with one > sink. > > > > > > > > > > > > But for failover sink group , looks like both the sinks should > be hooked up to the same channel, > > > > > > but this is not mentioned any where. > > > > > > > > > > > > Similarly, what exactly needs to be done for load balancing sink > ? > > > > > > > > > > > > thanks > > > --20cf307f362ee43a5604c7f6bf2e Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Some really insightful explanations Hari, thanks for the insight.
Btw, I do feel all this should be in flume user guide for the greater= good of mankind :)



On Thu, Aug 23, 2012 at 6:45 PM, Hari Shreedharan <hshreedharan@cl= oudera.com> wrote:
Please see inline.

--
Hari Shreedharan


On Thursday, August 23, 2012 at 3:28 PM, Bhaskar V.= Karambelkar wrote:

> My replies in line. and thanks for the detailed explanations.
>
> On Thu, Aug 23, 2012 at 2:57 PM, Hari Shreedha= ran <hshreedharan@cloudera.= com (mailto:hshreedharan@c= loudera.com)> wrote:
> >
> > Please see inline.
> >
> > --
> > Hari Shreedharan
> >
> >
> > On Thursday, August 23, 2012 at 11:43 AM, Bhaskar V. Karambelkar = wrote:
> >
> > > Hi Hari,
> > > Yes I did read the whole guide end to end.
> > > But I still have doubts
> > >
> > > The fact that multiple sinks can feed from the same channel = is news to me. I don't see it explicitly mentioned in the docs,
> > > so i guess I assumed wrongly, that only one sink can feed fr= om a channel.
> > >
> > > a)Can you explain in detail , how having multiple sinks taki= ng events from one channel, is useful in a "fast source slow sinks&quo= t; scenario ?
> > When multiple sinks read events from the same channel, you essent= ially have as many threads taking events out, since each sink has at least = one thread. So if your source is dumping n events per second into the chann= el, and your sink can only process 1 event per second, you could have n sin= ks to read n events per second (this is hypothetical - your hardware and yo= ur OS will restrict performance when the number of threads starts growing a= lot). A channel returns an event only once, how many ever sinks are taking= from the channel. Each event if removed and committed will never be given = to another sink. If there is a rollback, it is just like the event was neve= r taken, and a different sink will be able to take and commit it.
> >
>
>
> OK this makes sense.
>
> > >
> > > b) Also if I read your explanation below correctly there are= 3 possible cases
> > >
> > > 1) multiple sinks feeding from a single channel , with the d= efault sink processor this will be like a multiplexing channel with all sin= ks getting all the events that come in the channel.
> > No, every time a take() is called from the channel, the channel w= ill return that event only to one sink. So each sink will get a unique even= t(unless rollbacks happen - in which case the channel will put the events b= ack into the channel and a different sink might be able to pick it up).
> >
>
>
> So this situation is exactly like a load balancing one, as events are = somewhat equally distributed between all sinks ?
Not necessarily equally distributed. Sinks poll the channel to take t= he event. If a sink is slow in polling channels then it will get fewer even= ts, and if a channel is faster then that will get more events, since they a= re running on different threads.
>
> > >
> > > 2) multiple sinks feeding from a single channel , with fail_= over sink processor, only one sink will get the events at a give time, with= flume failing over to next available sink in case the first one fails ? > > A sink group essentially treats n sinks like one, and depending o= n the criteria, will select one sink to process the next event from the cha= nnel. In case of failover, sinks are picked in order of priority - and when= one sink fails, the next one is picked.
> >
>
>
> OK this makes sense.
>
> > >
> > > 3) multiple sinks feeding from a single channel, with load b= alancing processor, with all sinks getting events in a round-robin/random o= rder.
> > No, each sink will get a different event. One sink processes one = event and the next one picked will process the next event from the channel.=
>
>
> Yes that's exactly what I meant, I didn't imply that all sinks= get all events, but the events are distributed more or less equally among = the sinks in round-robin/random order.
> As I said about this looks almost like #1, except here you have a cont= rol over the selection algorithm (round-robin/random)

Not just that you have control, this will not depend on the sink'= s performance because all sinks are run from the same thread. So slower sin= ks can slow down the whole process since only one sink reads from the chann= el at any point in time. Think of a load balancing sink selector as a loop = which picks up one sink and passes the event to that one. Since there is on= ly one thread per sink group, having one sink group is often slower than ha= ving multiple sinks reading from the same channel.
>
> > > Is this a correct assumption ? I am aware of #2 and #3, not = sure about #1.
> > >
> > > On Thu, Aug 23= , 2012 at 12:43 PM, Hari Shreedharan <hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com) (mailto:hshreedharan@cloudera.com)> wrote:
> > > > Did you read this: http://flume.ap= ache.org/FlumeUserGuide.html#flume-sink-processors
> > > >
> > > > That explains how to use sink groups. Also there is not= hing wrong with multiple sinks taking events from one channel. This is an e= specially useful configuration if you have a very fast source and much slow= er sinks.
> > > >
> > > >
> > > > Hari
> > > >
> > > > --
> > > > Hari Shreedharan
> > > >
> > > >
> > > > On Thursday, August 23, 2012 at 9:28 AM, Bhaskar V. Kar= ambelkar wrote:
> > > >
> > > > > The sink group document doesn't mention anythi= ng about how
> > > > > to hook up sink groups to the rest of the config i= n order to work.
> > > > >
> > > > > e.g. under normal circumstances one channel is lin= ked with one sink.
> > > > >
> > > > > But for failover sink group , looks like both the = sinks should be hooked up to the same channel,
> > > > > but this is not mentioned any where.
> > > > >
> > > > > Similarly, what exactly needs to be done for load = balancing sink ?
> > > > >
> > > > > thanks



--20cf307f362ee43a5604c7f6bf2e--