Mailing-List: contact user-help@flume.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flume.apache.org
Received-SPF: pass (athena.apache.org: domain of chen.apache.solr@gmail.com
 designates 209.85.214.46 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CADMABv3a9KPuD=_n+kykghzDBQY6xMbYKihOf8rR=94nMpVFOg@mail.gmail.com>
References: 
 <CACim9RnVX1YoO0_sawPzEBNGRYTt8S-Wif00fN7NT5CMzkFz2A@mail.gmail.com>
	<CACim9R=8NCv8HUCxVQf29vSgjR4h-kX7Uq4JewyKgrC-Ujs34A@mail.gmail.com>
	<CADMABv3a9KPuD=_n+kykghzDBQY6xMbYKihOf8rR=94nMpVFOg@mail.gmail.com>
Date: Thu, 9 Jan 2014 19:24:59 -0800
Message-ID: 
 <CACim9RnKG3P3sjCe1Q7sSPsSX_z7d0igYErLN3kVgPo65h-XCA@mail.gmail.com>
Subject: Re: seeking help on flume cluster deployment
From: Chen Wang <chen.apache.solr@gmail.com>
To: user@flume.apache.org
Content-Type: multipart/alternative; boundary=20cf301ee8115b0b4b04ef954507

--20cf301ee8115b0b4b04ef954507
Content-Type: text/plain; charset=ISO-8859-1

Thanks Saurabh.
If that is the case, I am actually thinking about using storm spout to talk
to our socket server so that the storm cluster can take care of the reading
socket server part. Then in each storm node, start a flume agent, listening
on a RPC port and write to HDFS(with fail over) .Then in the storm bolt,
simply send the data to RPC so that flume can get it.
How do you think of this setup? It takes care of both failover on the
source(by storm) and on the sink(by flume) But It looks a little
complicated for me.
Chen


On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B <qna.list.141211@gmail.com> wrote:

> Hi Chen,
>
> I think Flume doesn't have a way to configure multiple sources pointing to
> same data source. Of course you can do that, but you will end up with
> duplicate data. Flume offers fail over at the sink level.
>
> On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>
>> Ok. so after more researching:) It seems that what i need is the failover
>> for agent source, (not fail over for sink):
>> If one agent dies, another same kind of agent will start running.
>> Does flume support this scenario?
>> Thanks,
>> Chen
>>
>>
>> On Thu, Jan 9, 2014 at 3:12 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>
>>> After reading more docs, it seems that if I want to achieve my goal, i
>>> have to do the following:
>>> 1. Having one agent with the custom source running on one node. This
>>> agent reads from those 5 socket server, and sink to some kind of sink(maybe
>>> another socket?)
>>> 2. On another(or more) machines, setting up collectors that read from
>>> the agent sink in 1, and sink to hdfs.
>>> 3. Having a master node managing nodes in 1,2.
>>>
>>> But it seems to be overskilled in my case: in 1, i can already sink to
>>> hdfs. Since the data available at socket server are much faster than the
>>> data translation part.  I want to be able to later add more nodes to do the
>>> translation job. so what is the correct setup?
>>> Thanks,
>>> Chen
>>>
>>>
>>>
>>> On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>>
>>>> Guys,
>>>> In my environment, the client is 5 socket servers. Thus i wrote a
>>>> custom source spawning 5 threads reading from each of them infinitely,and
>>>> the sink is hdfs(hive table). The work fine by running flume-ng agent.
>>>>
>>>> But how can i deploy this in distributed mode(cluster)? I am confused
>>>> about the 3 ties(agent,collector,storage) mentioned in the doc. Does it
>>>> apply to my case? How can I separate my agent/collect/storage? Apparently i
>>>> can only have one agent running: multiple agent will result in getting
>>>> duplicates from the socket server. But I want that if one agent dies, other
>>>> agent can take it up. I would also like to be able to add horizontal
>>>> scalability for writing to hdfs. How can I achieve all this?
>>>>
>>>> thank you very much for your advice.
>>>> Chen
>>>>
>>>
>>>
>>
>
>
> --
> Mailing List Archives,
> QnaList.com
>

--20cf301ee8115b0b4b04ef954507
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks Saurabh.=A0<div>If that is the case, I am actually =
thinking about using storm spout to talk to our socket server so that the s=
torm cluster can take care of the reading socket server part. Then in each =
storm node, start a flume agent, listening on a RPC port and write to HDFS(=
with fail over) .Then in the storm bolt, simply send the data to RPC so tha=
t flume can get it.=A0<div>
How do you think of this setup? It takes care of both failover on the sourc=
e(by storm) and on the sink(by flume) But It looks a little complicated for=
 me.</div><div>Chen</div></div></div><div class=3D"gmail_extra"><br><br><di=
v class=3D"gmail_quote">
On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B <span dir=3D"ltr">&lt;<a href=3D"=
mailto:qna.list.141211@gmail.com" target=3D"_blank">qna.list.141211@gmail.c=
om</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"marg=
in:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir=3D"ltr">Hi Chen,<div><br></div><div>I think Flume doesn&#39;t have=
 a way to configure multiple sources pointing to same data source. Of cours=
e you can do that, but you will end up with duplicate data. Flume offers fa=
il over at the sink level.</div>

<div><br></div><div><div class=3D"gmail_extra"><div><div class=3D"h5"><div =
class=3D"gmail_quote">On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang <span dir=
=3D"ltr">&lt;<a href=3D"mailto:chen.apache.solr@gmail.com" target=3D"_blank=
">chen.apache.solr@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Ok. so after more researchi=
ng:) It seems that what i need is the failover for agent source, (not fail =
over for sink):<div>


If one agent dies, another same kind of agent will start running.</div><div=
>Does flume support this scenario?</div>
<div>Thanks,</div><div>Chen=A0</div></div><div><div><div class=3D"gmail_ext=
ra"><br><br><div class=3D"gmail_quote">On Thu, Jan 9, 2014 at 3:12 PM, Chen=
 Wang <span dir=3D"ltr">&lt;<a href=3D"mailto:chen.apache.solr@gmail.com" t=
arget=3D"_blank">chen.apache.solr@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">After reading more docs, it=
 seems that if I want to achieve my goal, i have to do the following:<div>1=
. Having one agent with the custom source running on one node. This agent r=
eads from those 5 socket server, and sink to some kind of sink(maybe anothe=
r socket?)</div>


<div>2. On another(or more) machines, setting up collectors that read from =
the agent sink in 1, and sink to hdfs.</div><div>3. Having a master node ma=
naging nodes in 1,2.</div><div><br></div><div>But it seems to be overskille=
d in my case: in 1, i can already sink to hdfs. Since the data available at=
 socket server are much faster than the data translation part. =A0I want to=
 be able to later add more nodes to do the translation job. so what is the =
correct setup?</div>


<div>Thanks,</div><div>Chen<div><div><br><div class=3D"gmail_extra"><br><br=
><div class=3D"gmail_quote">On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:chen.apache.solr@gmail.com" target=3D"_b=
lank">chen.apache.solr@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Guys,<div>In my environment=
, the client is 5 socket servers. Thus i wrote a custom source spawning 5 t=
hreads reading from each of them infinitely,and the sink is hdfs(hive table=
). The work fine by running=A0<span style=3D"color:rgb(51,51,51);font-famil=
y:Courier;font-size:10pt">flume-ng agent.</span></div>


<span style=3D"font-size:10pt;font-family:Courier;color:rgb(51,51,51)">=A0<=
br></span><div>But how can i deploy this in distributed mode(cluster)? I am=
 confused about the 3 ties(agent,collector,storage) mentioned in the doc. D=
oes it apply to my case? How can I separate my agent/collect/storage? Appar=
ently i can only have one agent running: multiple agent will result in gett=
ing duplicates from the socket server. But I want that if one agent dies, o=
ther agent can take it up. I would also like to be able to add horizontal s=
calability for writing to hdfs. How can I achieve all this?<br>


</div><div><br></div><div>thank you very much for your advice.</div><span><=
font color=3D"#888888"><div>Chen</div></font></span></div>
</blockquote></div><br></div></div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span class=3D"HOEnZb"><font color=3D"#888888">-- <br><div dir=3D"ltr"=
>Mailing List Archives,<div><a href=3D"http://QnaList.com" target=3D"_blank=
">QnaList.com</a></div>
</div>
</font></span></div></div></div>
</blockquote></div><br></div>

--20cf301ee8115b0b4b04ef954507--