flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish <paliwalash...@gmail.com>
Subject Re: seeking help on flume cluster deployment
Date Fri, 10 Jan 2014 06:00:39 GMT
Got it!

My first reaction was to use HDFS bolt to write data directly to HDFS, but
couldn't find an implementation for the same. My knowledge is limited for
Storm.
If the data is already flowing through Storm, you got two options
1. Write a bolt to dump data to HDFS
2. Write a Flume bolt using RPC client as recommended in thread, and reuse
Flume's capabilities.

If you already have Flume installation running, #2 is quickest way of
running. Otherwise also, installing and running Flume is like a walk in the
park :)

You can also watch related discussion on
https://issues.apache.org/jira/browse/FLUME-1286. There is some good info
in the JIRA.

thanks
ashish




On Fri, Jan 10, 2014 at 11:08 AM, Chen Wang <chen.apache.solr@gmail.com>wrote:

> Ashish,
> Since we already use storm for other real time processing, i thus want to
> re utilize it. The biggest advantage for me of using storm in this case is
> that i could use storm's spout to read from our socket server continuously,
> and the storm framework can ensure it never stops. Meantime, i can also
> easily filter out /translate the data in bolt before sending to flume. For
> this piece of data stream, right now my first step is to get it into hdfs,
> but will add real time processing soon.
> Does that make sense to you?
> Thanks,
> Chen
>
>
> On Thu, Jan 9, 2014 at 9:29 PM, Ashish <paliwalashish@gmail.com> wrote:
>
>> Why do you need Storm? Are you doing any real time processing? If not,
>> IMHO, avoid Storm.
>>
>> Can use something like this
>>
>> Socket -> Load Balanced RPC Client -> Flume Topology with HA
>>
>> What Application level protocol are you using at Socket level?
>>
>>
>> On Fri, Jan 10, 2014 at 10:50 AM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>
>>> Jeff, Joao,
>>> Thanks for the pointer!
>>> I think i am getting close here:
>>> 1. set up a cluster of flume agent with redundancies, source as avro,
>>> sink as HDFS.
>>> 2 use storm(not quite necessary) to read from our socket server, then in
>>> the bolt, using flume client (load balancing rpc client) to send the event
>>> to the agent set up in step 1.
>>>
>>> Then I thus get all the benefit of storm and flume. Does this set up
>>> look right to you?
>>> thank you very much,
>>> Chen
>>>
>>>
>>> On Thu, Jan 9, 2014 at 8:58 PM, Joao Salcedo <joao.salcedo@gmail.com>wrote:
>>>
>>>> Hi Chen,
>>>>
>>>> Maybe it would be worth checking this
>>>>
>>>> http://flume.apache.org/FlumeDeveloperGuide.html#loadbalancing-rpc-client
>>>>
>>>> Regards,
>>>>
>>>> Joao
>>>>
>>>>
>>>> On Fri, Jan 10, 2014 at 3:50 PM, Jeff Lord <jlord@cloudera.com> wrote:
>>>>
>>>>> Have you taken a look at the load balancing rpc client?
>>>>>
>>>>>
>>>>> On Thu, Jan 9, 2014 at 8:43 PM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>>>>
>>>>>> Jeff,
>>>>>> I have read this ppt at the beginning, but didn't find solution to
my
>>>>>> user case. To simplify my case, I only have 1 data source(composed
of 5
>>>>>> socket server)  and i am looking for a fault tolerant deployment
of flume,
>>>>>> that can read from this single data source and sink to hdfs in fault
>>>>>> tolerant mode: when one node dies, another flume node can pick up
and
>>>>>> continue;
>>>>>> Thanks,
>>>>>> Chen
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 9, 2014 at 7:49 PM, Jeff Lord <jlord@cloudera.com>
wrote:
>>>>>>
>>>>>>> Chen,
>>>>>>>
>>>>>>> Have you taken a look at this presentation on Planning and Deploying
>>>>>>> Flume from ApacheCon?
>>>>>>>
>>>>>>>
>>>>>>> http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf
>>>>>>>
>>>>>>> It may have the answers you need.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Jeff
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang <
>>>>>>> chen.apache.solr@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Saurabh.
>>>>>>>> If that is the case, I am actually thinking about using storm
spout
>>>>>>>> to talk to our socket server so that the storm cluster can
take care of the
>>>>>>>> reading socket server part. Then in each storm node, start
a flume agent,
>>>>>>>> listening on a RPC port and write to HDFS(with fail over)
.Then in the
>>>>>>>> storm bolt, simply send the data to RPC so that flume can
get it.
>>>>>>>> How do you think of this setup? It takes care of both failover
on
>>>>>>>> the source(by storm) and on the sink(by flume) But It looks
a little
>>>>>>>> complicated for me.
>>>>>>>> Chen
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B <
>>>>>>>> qna.list.141211@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Chen,
>>>>>>>>>
>>>>>>>>> I think Flume doesn't have a way to configure multiple
sources
>>>>>>>>> pointing to same data source. Of course you can do that,
but you will end
>>>>>>>>> up with duplicate data. Flume offers fail over at the
sink level.
>>>>>>>>>
>>>>>>>>> On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang <
>>>>>>>>> chen.apache.solr@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Ok. so after more researching:) It seems that what
i need is the
>>>>>>>>>> failover for agent source, (not fail over for sink):
>>>>>>>>>> If one agent dies, another same kind of agent will
start running.
>>>>>>>>>> Does flume support this scenario?
>>>>>>>>>> Thanks,
>>>>>>>>>> Chen
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 9, 2014 at 3:12 PM, Chen Wang <
>>>>>>>>>> chen.apache.solr@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> After reading more docs, it seems that if I want
to achieve my
>>>>>>>>>>> goal, i have to do the following:
>>>>>>>>>>> 1. Having one agent with the custom source running
on one node.
>>>>>>>>>>> This agent reads from those 5 socket server,
and sink to some kind of
>>>>>>>>>>> sink(maybe another socket?)
>>>>>>>>>>> 2. On another(or more) machines, setting up collectors
that read
>>>>>>>>>>> from the agent sink in 1, and sink to hdfs.
>>>>>>>>>>> 3. Having a master node managing nodes in 1,2.
>>>>>>>>>>>
>>>>>>>>>>> But it seems to be overskilled in my case: in
1, i can already
>>>>>>>>>>> sink to hdfs. Since the data available at socket
server are much faster
>>>>>>>>>>> than the data translation part.  I want to be
able to later add more nodes
>>>>>>>>>>> to do the translation job. so what is the correct
setup?
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Chen
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang <
>>>>>>>>>>> chen.apache.solr@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Guys,
>>>>>>>>>>>> In my environment, the client is 5 socket
servers. Thus i wrote
>>>>>>>>>>>> a custom source spawning 5 threads reading
from each of them infinitely,and
>>>>>>>>>>>> the sink is hdfs(hive table). The work fine
by running flume-ng
>>>>>>>>>>>> agent.
>>>>>>>>>>>>
>>>>>>>>>>>> But how can i deploy this in distributed
mode(cluster)? I am
>>>>>>>>>>>> confused about the 3 ties(agent,collector,storage)
mentioned in the doc.
>>>>>>>>>>>> Does it apply to my case? How can I separate
my agent/collect/storage?
>>>>>>>>>>>> Apparently i can only have one agent running:
multiple agent will result in
>>>>>>>>>>>> getting duplicates from the socket server.
But I want that if one agent
>>>>>>>>>>>> dies, other agent can take it up. I would
also like to be able to add
>>>>>>>>>>>> horizontal scalability for writing to hdfs.
How can I achieve all this?
>>>>>>>>>>>>
>>>>>>>>>>>> thank you very much for your advice.
>>>>>>>>>>>> Chen
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Mailing List Archives,
>>>>>>>>> QnaList.com
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> thanks
>> ashish
>>
>> Blog: http://www.ashishpaliwal.com/blog
>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>
>
>


-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Mime
View raw message