flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bessenyei Balázs Donát <bes...@apache.org>
Subject Re: Use case for Flume
Date Sat, 09 Sep 2017 09:27:57 GMT
Hi Gintas,

I can't think of a completely Flume out-of-the-box solution, but I
believe Flume does suit your needs.
The multi-agent solution is doable, you'd have to either implement a
source (probably based on Avro Source) or implement an interceptor to
do the downloading as previously discussed.

If you need any further help, please let us know.


Thank you,

Donat

2017-09-05 21:13 GMT+02:00 Gintautas Sulskus <gintautas.sulskus@gmail.com>:
> Hi,
>
> Thanks for the quick replies, guys.
>
> Donat, sorry, I do not have example configs. At the moment I am just
> considering available solutions to tackle the problem at hand. I would very
> much prefer Flume for its modular and scalable approach. I would like to
> find an elegant solution that would be "native" to Flume.
> I was considering the two-agent approach as well. But then, how would the
> middle part would look like? What component would download the file? I
> assume I would face the same problem as now.
>
> Denes, files would be up to 5 megabytes in size.The interceptor approach
> looks the most suitable in this situation.
> Regarding the sink-side interceptor, wouldn't it have the same 64MB size
> limit as the source-side one?
>
> Best,
> Gintas
>
>
> On 5 Sep 2017 16:54, "Denes Arvay" <denes@cloudera.com> wrote:
>
> Hi GIntas,
>
> What is the average (or expected maximum) size of the files you'd like to
> process?
> In general it is not recommended to transfer large events (i.e. >64MB if you
> use file channel, as this is a hard limit of the protobuf implementation).
> If your files fit into this limit then I'd suggest to use an interceptor to
> fetch the data and then update the event's body and push it through Flume.
>
> In this case your setup would be:
> Kafka source + data fetcher interceptor (custom code) -> file channel (or
> memory) -> HDFS sink
>
> If the files are larger then you could use a customised HDFS sink which
> fetches the URL and stores the file on HDFS.
> In this case I'd recommend to use a Kafka channel -> custom HDFS sink setup
> without configuring any source.
>
> Actually for your problem the sink-side interceptors would be a good
> solution (https://issues.apache.org/jira/browse/FLUME-2580), but
> unfortunately it is not implemented yet.
>
> Regards,
> Denes
>
> On Tue, Sep 5, 2017 at 2:00 PM Gintautas Sulskus
> <gintautas.sulskus@gmail.com> wrote:
>>
>> Hi,
>>
>> I have a question regarding Flume suitability for a particular use case.
>>
>> Task: There is an incoming constant stream of links that point to files.
>> Those files to be fetched and stored in HDFS.
>>
>> Desired implementation:
>>
>> 1. Each link to a file is stored in Kafka queue Q1.
>> 2. Flume A1.source monitors Q1 for new links.
>> 3. Upon retrieving a link from Q1, A1.source fetches the file. The file
>> eventually is stored in HDFS by A1.sink
>>
>> My concern here is a seemingly overloaded functionality of A1.source. The
>> A1.source would have to perform two activities: 1.) to periodically poll
>> queue Q1 for new links to files and then  2.) fetch those files.
>>
>> What do you think? Is there a cleaner way to achieve this, e.g. by using
>> an interceptor to fetch files? Would this be appropriate?
>>
>> Best,
>> GIntas
>
>

Mime
View raw message