flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish <paliwalash...@gmail.com>
Subject Re: Collecting thousands of sources
Date Fri, 05 Sep 2014 11:28:13 GMT
On Fri, Sep 5, 2014 at 4:01 PM, JuanFra Rodriguez Cardoso <
juanfra.rodriguez.cardoso@gmail.com> wrote:

> Thanks, both of you!
>
> @Ashish, Javi's thoughts are right. My use case is focused on sources for
> consuming SNMP traps. I came here from the already open discussion [1] with
> hopes anyone was facing with this problem.
>
> Your solution based on Async SNMP walker would help us to scale that
> thousands of agents, but it would reduce any sort of scenario to the same
> process:
>
> 1. Code a custom collector (async or not) for sending data to flume spool
> dir
> 2. Agents' sources would consume data from that dir.
>

You might not need to code a Custom Collector, NMS system do that already.
So if you have an NMS system in place, may be it can do this polling for
you and dump records, where Flume can pick the same.

If this is not an option, you need to write a custom collector, standalone
or within Flume Source.
I went through the SNMP Source, and have a suggestion. If PDU decoding can
be avoided, it would save a lot of CPU at collection tier. No action is
being taken in Source, so raw PDU can be offloaded to channel. I wrote SNMP
ping long back. Problem statement was similar, poll SNMP agents for
specific OID's. I didn't use SNMP lib directly, I just just it to encode
and decode packet and managed network layer myself.


> Don't you think it would be more suitable to include an option in
> flume.conf (path/to/list-of-thousands-sources) as Javi commented above?
> This way, agent's configuration would be easier to manage.
>

I think I agreed to this option :)


>
> [1] https://issues.apache.org/jira/browse/FLUME-2039
>
> Regards,
> ---
> JuanFra Rodriguez Cardoso
>
>
> 2014-09-05 10:20 GMT+02:00 Ashish <paliwalashish@gmail.com>:
>
>> Jovi,
>>
>> I have NMS background, so understand your concern.
>>
>> Answers inline
>>
>>
>> On Fri, Sep 5, 2014 at 12:44 PM, Javi Roman <javiroman@redoop.org> wrote:
>>
>>> Hello,
>>>
>>> The scenario which is describing Juanfra is related with the question
>>> I made a few days ago [1].
>>>
>>
>> I am not sure about this, Juanfra is a better person to comment.
>>
>>
>>>
>>> You can not install Flume agents in the SNMP managed devices, and you
>>> can not modify any software in the SNMP managed devide for use Flume
>>> client SDK (if I understand correctly your idea Ashish). There are two
>>> ways for SNMP data collection from SNMP devices using Flume (IMHO):
>>>
>>
>> Agreed.
>>
>>
>>>
>>> 1. To create a custom application which launches the SNMP queries to
>>> the thousand of devices, and log the answer into a file: In this case
>>> Flume can sniff this file with the "exec source" core plugin (tail).
>>>
>>
>> IMHO, would be preferred way for me. Create a Simple SNMP walker
>> which polls Nodes in parallel and writes responses in file.
>> Use Flume's Spool Dir Source and rest flow remains same.
>> I would avoid decoding Events unless it needs to be interpreted in Flume
>> chain
>>
>>
>>>
>>> 2. To use a flume-snmp-source plugin (similar to [2]), in other words,
>>> to shift the SNMP query custom application into a specialized Flume
>>> plugin.
>>>
>>
>> Possible, it's like running #1 solution inside Flume.
>> For both you would need to maintain which all Agents have been sent
>> requests.
>> Async SNMP walker would help you scale to thousands of Agents.
>>
>>
>>>
>>> Juanfra is talking about a scenario like the second point. In that
>>> case you have to handle a huge flume configuration file, with an entry
>>> for each managed device to query. For this situation I guess there are
>>> two possible solutions:
>>>
>>> 1. The flume-snmp-source plugin can use a file with a list of host to
>>> query as parameter:
>>>
>>> agent.sources.source1.host = /path/to/list-of-host-file
>>>
>>> However I guess this breaks the philosophy or the simplicity of other
>>> core plugins of Flume.
>>>
>>> 2.  Create a little program to fill the flume configuration file with
>>> a template, or something similar.
>>>
>>
>> I would go with #1, it keep Flume config file simple. We still need to
>> distribute the file but on a small scale.
>>
>>
>>>
>>>
>>> Any other ideas? I guess this is a good discussion about a real world
>>> use case.
>>>
>>>
>>> [1]
>>> http://mail-archives.apache.org/mod_mbox/flume-user/201409.mbox/browser
>>> [2] https://github.com/javiroman/flume-snmp-source
>>>
>>> On Fri, Sep 5, 2014 at 4:56 AM, Ashish <paliwalashish@gmail.com> wrote:
>>> >
>>> > Have a look at Flume Client SDK. One simple way would be to use Flume
>>> clients implementations to send Events to Flume Sources, this would
>>> significantly reduce the number of Sources you need to manage.
>>> >
>>> > HTH !
>>> >
>>> >
>>> > On Thu, Sep 4, 2014 at 9:40 PM, JuanFra Rodriguez Cardoso <
>>> juanfra.rodriguez.cardoso@gmail.com> wrote:
>>> >>
>>> >> Thanks Andrew for your quick response.
>>> >>
>>> >> My sources (server PUD) can't put events into an agregation point.
>>> For this reason I'm following a PollingSource schema where my agent needs
>>> to be configured with thousands of sources. Any clues for use cases where
>>> data is injected considering a polling process?
>>> >>
>>> >> Regards!
>>> >> ---
>>> >> JuanFra Rodriguez Cardoso
>>> >>
>>> >>
>>> >> 2014-09-04 17:41 GMT+02:00 Andrew Ehrlich <andrew@aehrlich.com>:
>>> >>>
>>> >>> One way to avoid managing so many sources would be to have an
>>> aggregation point between the data generators the flume sources. For
>>> example, maybe you could have the data generators put events into a message
>>> queue(s), then have flume consume from there?
>>> >>>
>>> >>> Andrew
>>> >>>
>>> >>> ---- On Thu, 04 Sep 2014 08:29:04 -0700 JuanFra Rodriguez Cardoso<
>>> juanfra.rodriguez.cardoso@gmail.com> wrote ----
>>> >>>
>>> >>> Hi all:
>>> >>>
>>> >>> Considering an environment with thousands of sources, which are
the
>>> best practices for managing the agent configuration (flume.conf)? Is it
>>> recommended to create a multi-layer topology where each agent takes control
>>> of a subset of sources?
>>> >>>
>>> >>> In that case, a conf mgmg server (such as Puppet) would be
>>> responsible for editing flume.conf  with parameters 'agent.sources' from
>>> source1 to source3000 (assuming we have 3000 sources machines).
>>> >>>
>>> >>> Are my thoughts aligned with that scenarios of large scale data
>>> ingest?
>>> >>>
>>> >>> Thanks a lot!
>>> >>> ---
>>> >>> JuanFra
>>> >>>
>>> >>>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > thanks
>>> > ashish
>>> >
>>> > Blog: http://www.ashishpaliwal.com/blog
>>> > My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>
>>
>>
>>
>> --
>> thanks
>> ashish
>>
>> Blog: http://www.ashishpaliwal.com/blog
>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>
>
>


-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Mime
View raw message