Return-Path: X-Original-To: apmail-flume-user-archive@www.apache.org Delivered-To: apmail-flume-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 01711101F2 for ; Fri, 10 Jan 2014 04:58:53 +0000 (UTC) Received: (qmail 91167 invoked by uid 500); 10 Jan 2014 04:58:40 -0000 Delivered-To: apmail-flume-user-archive@flume.apache.org Received: (qmail 91116 invoked by uid 500); 10 Jan 2014 04:58:38 -0000 Mailing-List: contact user-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flume.apache.org Delivered-To: mailing list user@flume.apache.org Received: (qmail 91108 invoked by uid 99); 10 Jan 2014 04:58:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jan 2014 04:58:38 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of joao.salcedo@gmail.com designates 209.85.216.47 as permitted sender) Received: from [209.85.216.47] (HELO mail-qa0-f47.google.com) (209.85.216.47) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jan 2014 04:58:34 +0000 Received: by mail-qa0-f47.google.com with SMTP id f11so3579362qae.6 for ; Thu, 09 Jan 2014 20:58:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=bLdZstGe6zKLMClpqqfvkB9NrjKVSydanQoMsJs1u2c=; b=qUvWVay6ULJSSC3e84p+FN2dhFQq4BOMGlwN7L7y3Kt3N9ieH/1OtF2mFi+8PZBC3a 72jaxKTuiuUDxWXTb9jZ+BNqHwwJKMcSXqfIG3lh/ZbwNSNTApjBKFLZUPYY21WHhqpP OARH5rbMCFw8iDluseXClIzmY7Mocg35yElQPsXFViPsMJx90q++B8pZjdMJ+gZE1rq7 Z+B1nWX/cXBgWcx4lKjqeErABO2HFMWuK5mnr9EIvR1ImQS23xKzD3tYG4AQMkUyviLD nQggglUamrM99ddu4AIRPcoXUiwl7Fk7ghMSvOaMc0ZKxHVX8yNUp8Z5HPRMpubnkLYs o4Mg== MIME-Version: 1.0 X-Received: by 10.49.131.5 with SMTP id oi5mr3111546qeb.38.1389329893199; Thu, 09 Jan 2014 20:58:13 -0800 (PST) Received: by 10.229.67.196 with HTTP; Thu, 9 Jan 2014 20:58:13 -0800 (PST) In-Reply-To: References: Date: Fri, 10 Jan 2014 15:58:13 +1100 Message-ID: Subject: Re: seeking help on flume cluster deployment From: Joao Salcedo To: user@flume.apache.org Content-Type: multipart/alternative; boundary=047d7bdc0628c296ea04ef9692a0 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdc0628c296ea04ef9692a0 Content-Type: text/plain; charset=ISO-8859-1 Hi Chen, Maybe it would be worth checking this http://flume.apache.org/FlumeDeveloperGuide.html#loadbalancing-rpc-client Regards, Joao On Fri, Jan 10, 2014 at 3:50 PM, Jeff Lord wrote: > Have you taken a look at the load balancing rpc client? > > > On Thu, Jan 9, 2014 at 8:43 PM, Chen Wang wrote: > >> Jeff, >> I have read this ppt at the beginning, but didn't find solution to my >> user case. To simplify my case, I only have 1 data source(composed of 5 >> socket server) and i am looking for a fault tolerant deployment of flume, >> that can read from this single data source and sink to hdfs in fault >> tolerant mode: when one node dies, another flume node can pick up and >> continue; >> Thanks, >> Chen >> >> >> On Thu, Jan 9, 2014 at 7:49 PM, Jeff Lord wrote: >> >>> Chen, >>> >>> Have you taken a look at this presentation on Planning and Deploying >>> Flume from ApacheCon? >>> >>> >>> http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf >>> >>> It may have the answers you need. >>> >>> Best, >>> >>> Jeff >>> >>> >>> On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang wrote: >>> >>>> Thanks Saurabh. >>>> If that is the case, I am actually thinking about using storm spout to >>>> talk to our socket server so that the storm cluster can take care of the >>>> reading socket server part. Then in each storm node, start a flume agent, >>>> listening on a RPC port and write to HDFS(with fail over) .Then in the >>>> storm bolt, simply send the data to RPC so that flume can get it. >>>> How do you think of this setup? It takes care of both failover on the >>>> source(by storm) and on the sink(by flume) But It looks a little >>>> complicated for me. >>>> Chen >>>> >>>> >>>> On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B wrote: >>>> >>>>> Hi Chen, >>>>> >>>>> I think Flume doesn't have a way to configure multiple sources >>>>> pointing to same data source. Of course you can do that, but you will end >>>>> up with duplicate data. Flume offers fail over at the sink level. >>>>> >>>>> On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang wrote: >>>>> >>>>>> Ok. so after more researching:) It seems that what i need is the >>>>>> failover for agent source, (not fail over for sink): >>>>>> If one agent dies, another same kind of agent will start running. >>>>>> Does flume support this scenario? >>>>>> Thanks, >>>>>> Chen >>>>>> >>>>>> >>>>>> On Thu, Jan 9, 2014 at 3:12 PM, Chen Wang >>>>> > wrote: >>>>>> >>>>>>> After reading more docs, it seems that if I want to achieve my goal, >>>>>>> i have to do the following: >>>>>>> 1. Having one agent with the custom source running on one node. This >>>>>>> agent reads from those 5 socket server, and sink to some kind of sink(maybe >>>>>>> another socket?) >>>>>>> 2. On another(or more) machines, setting up collectors that read >>>>>>> from the agent sink in 1, and sink to hdfs. >>>>>>> 3. Having a master node managing nodes in 1,2. >>>>>>> >>>>>>> But it seems to be overskilled in my case: in 1, i can already sink >>>>>>> to hdfs. Since the data available at socket server are much faster than the >>>>>>> data translation part. I want to be able to later add more nodes to do the >>>>>>> translation job. so what is the correct setup? >>>>>>> Thanks, >>>>>>> Chen >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang < >>>>>>> chen.apache.solr@gmail.com> wrote: >>>>>>> >>>>>>>> Guys, >>>>>>>> In my environment, the client is 5 socket servers. Thus i wrote a >>>>>>>> custom source spawning 5 threads reading from each of them infinitely,and >>>>>>>> the sink is hdfs(hive table). The work fine by running flume-ng >>>>>>>> agent. >>>>>>>> >>>>>>>> But how can i deploy this in distributed mode(cluster)? I am >>>>>>>> confused about the 3 ties(agent,collector,storage) mentioned in the doc. >>>>>>>> Does it apply to my case? How can I separate my agent/collect/storage? >>>>>>>> Apparently i can only have one agent running: multiple agent will result in >>>>>>>> getting duplicates from the socket server. But I want that if one agent >>>>>>>> dies, other agent can take it up. I would also like to be able to add >>>>>>>> horizontal scalability for writing to hdfs. How can I achieve all this? >>>>>>>> >>>>>>>> thank you very much for your advice. >>>>>>>> Chen >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Mailing List Archives, >>>>> QnaList.com >>>>> >>>> >>>> >>> >> > --047d7bdc0628c296ea04ef9692a0 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi Chen,

Maybe it would be w= orth checking this
http://flume.apache.org/FlumeDeveloperGu= ide.html#loadbalancing-rpc-client

Regards,

Joao


On Fri, Jan 10, 20= 14 at 3:50 PM, Jeff Lord <jlord@cloudera.com> wrote:
Have you taken a look at th= e load balancing rpc client?
<= div class=3D"gmail_extra">

On Thu, Jan 9, 2014 at 8:43 PM, Chen Wan= g <chen.apache.solr@gmail.com> wrote:
Jeff,
I have read this = ppt at the beginning, but didn't find solution to my user case. To simp= lify my case, I only have 1 data source(composed of 5 socket server) =A0and= i am looking for a fault tolerant deployment of flume, that can read from = this single data source and sink to hdfs in fault tolerant mode: when one n= ode dies, another flume node can pick up and continue;
Thanks,
Chen


On Thu, Jan 9, 2014 at 7:49 PM, Jeff Lo= rd <jlord@cloudera.com> wrote:
Chen,

Ha= ve you taken a look at this presentation on Planning and Deploying Flume fr= om ApacheCon?


It may have the answers you need.

<= /div>
Best,

Jeff


On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang <chen.apache.solr@gmail= .com> wrote:
Thanks Saurabh.=A0
If t= hat is the case, I am actually thinking about using storm spout to talk to = our socket server so that the storm cluster can take care of the reading so= cket server part. Then in each storm node, start a flume agent, listening o= n a RPC port and write to HDFS(with fail over) .Then in the storm bolt, sim= ply send the data to RPC so that flume can get it.=A0
How do you think of this setup? It takes care of both failover on the sourc= e(by storm) and on the sink(by flume) But It looks a little complicated for= me.
Chen


On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B <qna.list.141211@gmail.c= om> wrote:
Hi Chen,

I think Flume doesn't have= a way to configure multiple sources pointing to same data source. Of cours= e you can do that, but you will end up with duplicate data. Flume offers fa= il over at the sink level.

On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang <chen.apache.s= olr@gmail.com> wrote:
Ok. so after more researchi= ng:) It seems that what i need is the failover for agent source, (not fail = over for sink):
If one agent dies, another same kind of agent will start running.
Does flume support this scenario?
Thanks,
Chen=A0


On Thu, Jan 9, 2014 at 3:12 PM, Chen= Wang <chen.apache.solr@gmail.com> wrote:
After reading more docs, it= seems that if I want to achieve my goal, i have to do the following:
1= . Having one agent with the custom source running on one node. This agent r= eads from those 5 socket server, and sink to some kind of sink(maybe anothe= r socket?)
2. On another(or more) machines, setting up collectors that read from = the agent sink in 1, and sink to hdfs.
3. Having a master node ma= naging nodes in 1,2.

But it seems to be overskille= d in my case: in 1, i can already sink to hdfs. Since the data available at= socket server are much faster than the data translation part. =A0I want to= be able to later add more nodes to do the translation job. so what is the = correct setup?
Thanks,
Chen


On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang <chen.apache.solr@gmail.com> wrote:
Guys,
In my environment= , the client is 5 socket servers. Thus i wrote a custom source spawning 5 t= hreads reading from each of them infinitely,and the sink is hdfs(hive table= ). The work fine by running=A0flume-ng agent.
=A0<= br>
But how can i deploy this in distributed mode(cluster)? I am= confused about the 3 ties(agent,collector,storage) mentioned in the doc. D= oes it apply to my case? How can I separate my agent/collect/storage? Appar= ently i can only have one agent running: multiple agent will result in gett= ing duplicates from the socket server. But I want that if one agent dies, o= ther agent can take it up. I would also like to be able to add horizontal s= calability for writing to hdfs. How can I achieve all this?

thank you very much for your advice.
<= font color=3D"#888888">
Chen





<= /div>--
Mailing List Arc= hives,





--047d7bdc0628c296ea04ef9692a0--