Mailing-List: contact user-help@flume.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flume.apache.org
Received-SPF: pass (athena.apache.org: domain of joao.salcedo@gmail.com
 designates 209.85.216.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJmzdXnVat6FZM+D9B948Ro1M6ddTowsfWECUDesga0o8HCfzg@mail.gmail.com>
References: 
 <CACim9RnVX1YoO0_sawPzEBNGRYTt8S-Wif00fN7NT5CMzkFz2A@mail.gmail.com>
	<CACim9R=8NCv8HUCxVQf29vSgjR4h-kX7Uq4JewyKgrC-Ujs34A@mail.gmail.com>
	<CADMABv3a9KPuD=_n+kykghzDBQY6xMbYKihOf8rR=94nMpVFOg@mail.gmail.com>
	<CACim9RnKG3P3sjCe1Q7sSPsSX_z7d0igYErLN3kVgPo65h-XCA@mail.gmail.com>
	<CAJmzdXnSeERkYyS+f6+BeSMmELEsfj-OeeN23zuRPm4Zup=mRg@mail.gmail.com>
	<CACim9R=dS2HOy-XqjKPc+vkTcq+ftKBi2bDewY0z91+5TJWNvA@mail.gmail.com>
	<CAJmzdXnVat6FZM+D9B948Ro1M6ddTowsfWECUDesga0o8HCfzg@mail.gmail.com>
Date: Fri, 10 Jan 2014 15:58:13 +1100
Message-ID: 
 <CAPPn5VsS63riMx8Ci6tjDvCz7DRLbMHENZo406x+53Jk8PXXBQ@mail.gmail.com>
Subject: Re: seeking help on flume cluster deployment
From: Joao Salcedo <joao.salcedo@gmail.com>
To: user@flume.apache.org
Content-Type: multipart/alternative; boundary=047d7bdc0628c296ea04ef9692a0

--047d7bdc0628c296ea04ef9692a0
Content-Type: text/plain; charset=ISO-8859-1

Hi Chen,

Maybe it would be worth checking this
http://flume.apache.org/FlumeDeveloperGuide.html#loadbalancing-rpc-client

Regards,

Joao


On Fri, Jan 10, 2014 at 3:50 PM, Jeff Lord <jlord@cloudera.com> wrote:

> Have you taken a look at the load balancing rpc client?
>
>
> On Thu, Jan 9, 2014 at 8:43 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>
>> Jeff,
>> I have read this ppt at the beginning, but didn't find solution to my
>> user case. To simplify my case, I only have 1 data source(composed of 5
>> socket server)  and i am looking for a fault tolerant deployment of flume,
>> that can read from this single data source and sink to hdfs in fault
>> tolerant mode: when one node dies, another flume node can pick up and
>> continue;
>> Thanks,
>> Chen
>>
>>
>> On Thu, Jan 9, 2014 at 7:49 PM, Jeff Lord <jlord@cloudera.com> wrote:
>>
>>> Chen,
>>>
>>> Have you taken a look at this presentation on Planning and Deploying
>>> Flume from ApacheCon?
>>>
>>>
>>> http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf
>>>
>>> It may have the answers you need.
>>>
>>> Best,
>>>
>>> Jeff
>>>
>>>
>>> On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>>
>>>> Thanks Saurabh.
>>>> If that is the case, I am actually thinking about using storm spout to
>>>> talk to our socket server so that the storm cluster can take care of the
>>>> reading socket server part. Then in each storm node, start a flume agent,
>>>> listening on a RPC port and write to HDFS(with fail over) .Then in the
>>>> storm bolt, simply send the data to RPC so that flume can get it.
>>>> How do you think of this setup? It takes care of both failover on the
>>>> source(by storm) and on the sink(by flume) But It looks a little
>>>> complicated for me.
>>>> Chen
>>>>
>>>>
>>>> On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B <qna.list.141211@gmail.com>wrote:
>>>>
>>>>> Hi Chen,
>>>>>
>>>>> I think Flume doesn't have a way to configure multiple sources
>>>>> pointing to same data source. Of course you can do that, but you will end
>>>>> up with duplicate data. Flume offers fail over at the sink level.
>>>>>
>>>>> On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>>>>
>>>>>> Ok. so after more researching:) It seems that what i need is the
>>>>>> failover for agent source, (not fail over for sink):
>>>>>> If one agent dies, another same kind of agent will start running.
>>>>>> Does flume support this scenario?
>>>>>> Thanks,
>>>>>> Chen
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 9, 2014 at 3:12 PM, Chen Wang <chen.apache.solr@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> After reading more docs, it seems that if I want to achieve my goal,
>>>>>>> i have to do the following:
>>>>>>> 1. Having one agent with the custom source running on one node. This
>>>>>>> agent reads from those 5 socket server, and sink to some kind of sink(maybe
>>>>>>> another socket?)
>>>>>>> 2. On another(or more) machines, setting up collectors that read
>>>>>>> from the agent sink in 1, and sink to hdfs.
>>>>>>> 3. Having a master node managing nodes in 1,2.
>>>>>>>
>>>>>>> But it seems to be overskilled in my case: in 1, i can already sink
>>>>>>> to hdfs. Since the data available at socket server are much faster than the
>>>>>>> data translation part.  I want to be able to later add more nodes to do the
>>>>>>> translation job. so what is the correct setup?
>>>>>>> Thanks,
>>>>>>> Chen
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang <
>>>>>>> chen.apache.solr@gmail.com> wrote:
>>>>>>>
>>>>>>>> Guys,
>>>>>>>> In my environment, the client is 5 socket servers. Thus i wrote a
>>>>>>>> custom source spawning 5 threads reading from each of them infinitely,and
>>>>>>>> the sink is hdfs(hive table). The work fine by running flume-ng
>>>>>>>> agent.
>>>>>>>>
>>>>>>>> But how can i deploy this in distributed mode(cluster)? I am
>>>>>>>> confused about the 3 ties(agent,collector,storage) mentioned in the doc.
>>>>>>>> Does it apply to my case? How can I separate my agent/collect/storage?
>>>>>>>> Apparently i can only have one agent running: multiple agent will result in
>>>>>>>> getting duplicates from the socket server. But I want that if one agent
>>>>>>>> dies, other agent can take it up. I would also like to be able to add
>>>>>>>> horizontal scalability for writing to hdfs. How can I achieve all this?
>>>>>>>>
>>>>>>>> thank you very much for your advice.
>>>>>>>> Chen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mailing List Archives,
>>>>> QnaList.com
>>>>>
>>>>
>>>>
>>>
>>
>

--047d7bdc0628c296ea04ef9692a0
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hi Chen,</div><div><br></div><div>Maybe it would be w=
orth checking this</div><a href=3D"http://flume.apache.org/FlumeDeveloperGu=
ide.html#loadbalancing-rpc-client">http://flume.apache.org/FlumeDeveloperGu=
ide.html#loadbalancing-rpc-client</a><br>
<div><br></div><div>Regards,</div><div><br></div><div>Joao</div></div><div =
class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Fri, Jan 10, 20=
14 at 3:50 PM, Jeff Lord <span dir=3D"ltr">&lt;<a href=3D"mailto:jlord@clou=
dera.com" target=3D"_blank">jlord@cloudera.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Have you taken a look at th=
e load balancing rpc client?</div><div class=3D"HOEnZb"><div class=3D"h5"><=
div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Thu, Jan 9, 2014 at 8:43 PM, Chen Wan=
g <span dir=3D"ltr">&lt;<a href=3D"mailto:chen.apache.solr@gmail.com" targe=
t=3D"_blank">chen.apache.solr@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Jeff,<div>I have read this =
ppt at the beginning, but didn&#39;t find solution to my user case. To simp=
lify my case, I only have 1 data source(composed of 5 socket server) =A0and=
 i am looking for a fault tolerant deployment of flume, that can read from =
this single data source and sink to hdfs in fault tolerant mode: when one n=
ode dies, another flume node can pick up and continue;</div>


<div>Thanks,</div><div>Chen</div></div><div><div><div class=3D"gmail_extra"=
><br><br><div class=3D"gmail_quote">On Thu, Jan 9, 2014 at 7:49 PM, Jeff Lo=
rd <span dir=3D"ltr">&lt;<a href=3D"mailto:jlord@cloudera.com" target=3D"_b=
lank">jlord@cloudera.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Chen,<div><br></div><div>Ha=
ve you taken a look at this presentation on Planning and Deploying Flume fr=
om ApacheCon?</div>


<div><br></div><div><a href=3D"http://archive.apachecon.com/na2013/presenta=
tions/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big=
_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%=
20Apache%20Flume.pdf" target=3D"_blank">http://archive.apachecon.com/na2013=
/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfe=
r_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20D=
eploying%20Apache%20Flume.pdf</a><br>


</div><div><br></div><div>It may have the answers you need.</div><div><br><=
/div><div>Best,</div><div><br></div><div>Jeff</div></div><div><div><div cla=
ss=3D"gmail_extra"><br><br><div class=3D"gmail_quote">
On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang <span dir=3D"ltr">&lt;<a href=3D"=
mailto:chen.apache.solr@gmail.com" target=3D"_blank">chen.apache.solr@gmail=
.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thanks Saurabh.=A0<div>If t=
hat is the case, I am actually thinking about using storm spout to talk to =
our socket server so that the storm cluster can take care of the reading so=
cket server part. Then in each storm node, start a flume agent, listening o=
n a RPC port and write to HDFS(with fail over) .Then in the storm bolt, sim=
ply send the data to RPC so that flume can get it.=A0<div>


How do you think of this setup? It takes care of both failover on the sourc=
e(by storm) and on the sink(by flume) But It looks a little complicated for=
 me.</div><span><font color=3D"#888888"><div>Chen</div></font></span></div>


</div><div><div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quot=
e">
On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B <span dir=3D"ltr">&lt;<a href=3D"=
mailto:qna.list.141211@gmail.com" target=3D"_blank">qna.list.141211@gmail.c=
om</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"marg=
in:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div dir=3D"ltr">Hi Chen,<div><br></div><div>I think Flume doesn&#39;t have=
 a way to configure multiple sources pointing to same data source. Of cours=
e you can do that, but you will end up with duplicate data. Flume offers fa=
il over at the sink level.</div>


<div><br></div><div><div class=3D"gmail_extra"><div><div><div class=3D"gmai=
l_quote">On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:chen.apache.solr@gmail.com" target=3D"_blank">chen.apache.s=
olr@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Ok. so after more researchi=
ng:) It seems that what i need is the failover for agent source, (not fail =
over for sink):<div>


If one agent dies, another same kind of agent will start running.</div><div=
>Does flume support this scenario?</div>
<div>Thanks,</div><div>Chen=A0</div></div><div><div><div class=3D"gmail_ext=
ra"><br><br><div class=3D"gmail_quote">On Thu, Jan 9, 2014 at 3:12 PM, Chen=
 Wang <span dir=3D"ltr">&lt;<a href=3D"mailto:chen.apache.solr@gmail.com" t=
arget=3D"_blank">chen.apache.solr@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">After reading more docs, it=
 seems that if I want to achieve my goal, i have to do the following:<div>1=
. Having one agent with the custom source running on one node. This agent r=
eads from those 5 socket server, and sink to some kind of sink(maybe anothe=
r socket?)</div>


<div>2. On another(or more) machines, setting up collectors that read from =
the agent sink in 1, and sink to hdfs.</div><div>3. Having a master node ma=
naging nodes in 1,2.</div><div><br></div><div>But it seems to be overskille=
d in my case: in 1, i can already sink to hdfs. Since the data available at=
 socket server are much faster than the data translation part. =A0I want to=
 be able to later add more nodes to do the translation job. so what is the =
correct setup?</div>


<div>Thanks,</div><div>Chen<div><div><br><div class=3D"gmail_extra"><br><br=
><div class=3D"gmail_quote">On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:chen.apache.solr@gmail.com" target=3D"_b=
lank">chen.apache.solr@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Guys,<div>In my environment=
, the client is 5 socket servers. Thus i wrote a custom source spawning 5 t=
hreads reading from each of them infinitely,and the sink is hdfs(hive table=
). The work fine by running=A0<span style=3D"color:rgb(51,51,51);font-famil=
y:Courier;font-size:10pt">flume-ng agent.</span></div>


<span style=3D"font-size:10pt;font-family:Courier;color:rgb(51,51,51)">=A0<=
br></span><div>But how can i deploy this in distributed mode(cluster)? I am=
 confused about the 3 ties(agent,collector,storage) mentioned in the doc. D=
oes it apply to my case? How can I separate my agent/collect/storage? Appar=
ently i can only have one agent running: multiple agent will result in gett=
ing duplicates from the socket server. But I want that if one agent dies, o=
ther agent can take it up. I would also like to be able to add horizontal s=
calability for writing to hdfs. How can I achieve all this?<br>


</div><div><br></div><div>thank you very much for your advice.</div><span><=
font color=3D"#888888"><div>Chen</div></font></span></div>
</blockquote></div><br></div></div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span><font color=3D"#888888">-- <br><div dir=3D"ltr">Mailing List Arc=
hives,<div><a href=3D"http://QnaList.com" target=3D"_blank">QnaList.com</a>=
</div>


</div>
</font></span></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--047d7bdc0628c296ea04ef9692a0--