flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish <paliwalash...@gmail.com>
Subject Re: Newbie - Sink question
Date Sat, 06 Sep 2014 04:59:35 GMT
On Sat, Sep 6, 2014 at 4:42 AM, Kevin Warner <kevinwarner7965@gmail.com>
wrote:

> Thanks Andrew, Ashish and Sharinder for your response.
>
> I have a large number of JSON files which are 2K size each on Tomcat
> servers. We are using rsync to get the files from the Tomcat servers to the
> EC2 compute instances. Lets say we have 4 Tomcat servers, do we need 4
> machines (EC2) with Flume on them.
>

Nope, If you can install Flume on Tomcat servers, Flume shall transfer the
files for you. Use Spool Dir Source and ensure that it points to location
where only completed files are present.


>
> On each Flume machine, we have a folder that rsync's with the Tomcat
> server folder. The source of the Flume then points to the input folder and
> after processing (we are planning to use Morphlines) the output is written
> as CSV files and uploaded to S3.
>
> Can anyone send me some examples of sample Flume tiered architecture. By
> collection agents do you mean a set of machines, in which each machine is
> getting data from multiple Tomcat servers. And after that in the Collection
> layer, are there a set of machines where there is a 1-1 relationship
> between the machines in the Collection tier and Transformation tier has
> flume instances with Morphlines which then write the CSV output to S3.
> Also, does it support HA etc.
>
> Please advise.
>

You can have 2-tier topology. 1st Tier collects data from Tomcat Server or
from location specified. It then sends data to next set of Flume Agents
that do CSV transformation and writes to S3.


Tomcat Server --- Flume Agent(s) ---> Flume Agent (Layer 2) --> S3

For 1st layer you would need 4 Flume Agent's, assuming Flume is running on
Tomcat Servers. These agents picks up file from Server and send to Tier 2
Agents which does the translation and writes to S3.
You may need 2 or 4 agents based on the load or HA requirements.
Alternatively, Tier 1 agents may be able to handle the needs, you may leave
out Layer 2 agents.

Please do some benchmark, before choosing the topology.

This text would come in handy
http://shop.oreilly.com/product/0636920030348.do, I got it 2 days ago :)


>
> Thanks.
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Sep 4, 2014 at 11:08 PM, Ashish <paliwalashish@gmail.com> wrote:
>
>> I would recommend using an Interceptor for this and possibly a modified
>> Flume topology. If the json files have large numbers of rows or very high
>> number of files, go for a Collection tier, and use another level of agents
>> that uses interceptors for DB lookup and CSV generation. Something like
>>
>> Collection Agents -> Transformation Agents (writing to S3 Sinks)
>>
>> You can scale out Transformation/Collection layer agents  based on the
>> traffic volume
>>
>> thanks
>>
>>
>>
>>
>> On Fri, Sep 5, 2014 at 8:23 AM, Kevin Warner <kevinwarner7965@gmail.com>
>> wrote:
>>
>>> Hello All,
>>> We have the following configuration:
>>> Source->Channel->Sink
>>>
>>> Now, the source is pointing to a folder that has lots of json files. The
>>> channel is file based so that there is fault tolerance and the Sink is
>>> putting CSV files on S3.
>>>
>>> Now, there is code written in Sink that takes the JSON events and does
>>> some MySQL database lookup and generates CSV files to be put into S3.
>>>
>>> The question is, is it the right place for the code or should the code
>>> be running in channel as the ACID gaurantees is present in Channel. Please
>>> advise.
>>>
>>> -Kev
>>>
>>>
>>
>>
>>
>> --
>> thanks
>> ashish
>>
>> Blog: http://www.ashishpaliwal.com/blog
>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>
>
>


-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Mime
View raw message