flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asim Zafir <asim.za...@gmail.com>
Subject Re: distributed weblogs ingestion on HDFS via flume
Date Thu, 06 Feb 2014 01:29:04 GMT
Ed,

thanks for the response!.. I was wondering if we do use avro sink to hdfs,
I assume the resident file format in HDFS will be avro?, the reason i am
asking this question is because  hive/impala and map/reduce supposed to
have dependency on file format and compression.as stated here
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_file_formats.html

will be really interested to see your response as to how do you guys have
handled or suggest to handle these issues?

thanks

Asim


On Wed, Feb 5, 2014 at 4:09 PM, ed <edorsey@gmail.com> wrote:

> Hi Asim,
>
>
> Here's some information that might be helpful based on my relatively new
> experience with Flume:
>
>
>  *1) do all the webserver in our case needs to run a flume agent?*
>
>
> They could but don't necessarily have to.  For example, if you don't want
> to put a flume agent on all your web servers you could forward the logs
> using syslog to another server running a flume agent listening for the logs
> using the syslog source.  If you do want to put a flume agent on your web
> servers then you could send the logs to a local syslog source which would
> use the avro sink to pass the logs to the flume collection server which
> would do the actually writing to HDFS, or you could use a file spooler
> source to read the logs from disk and then forward them to the collector
> (again using avro source and sink)
>
>
> *Not Using Flume on the Webservers:*
>
>
> [webserver1: apache -> syslogd] ==>
>
> [webserver2: apache -> syslogd] ==> [flume collection server: flume syslog
> source --> flume hdfs sink]
>
> [webserver3: apache -> syslogd] ==>
>
>
> *Using Flume on the Webservers Option1:*
>
>
> [webserver1: apache -> syslogd -> flume syslog source -> flume avro sink]
> ==>
>
> [webserver2: apache -> syslogd -> flume syslog source -> flume avro sink]
> ==>  [flume collection server: flume avro source --> flume hdfs sink]
>
> [webserver3: apache -> syslogd -> flume syslog source -> flume avro sink]
> ==>
>
>
> *Using Flume on Webservers Option2:*
>
>
> [webserver1: apache -> filesystem -> flume file spooler source -> flume
> avro sink] ==>
>
> [webserver2: apache -> filesystem -> flume file spooler source -> flume
> avro sink] ==> [flume collection server: flume avro source --> flume hdfs
> sink]
>
> [webserver3: apache -> filesystem -> flume file spooler source -> flume
> avro sink] ==>
>
>
> (by the way there are probably other ways to do this and you could even
> split out the collection tier from the storage tier (currently done by the
> same final agent)
>
>
> *2) do all the webserver will be acting as source in our setup ?*
>
>
> They will be acting as a source in the general sense that you want to
> ingest their logs.  However, they don't necessarily have to run a flume
> agent if you have some other way to ship the logs to a listening flume
> agent somewhere (most likely using syslog but we've also had success with
> receiving logs via the netcat source).
>
>
> *3) can we sync webservers logs directly to HDFS store by passing
> channels?*
>
>
> Not sure what you mean here but you will need a flume source and sink
> running (in this case an HDFS sink).  You can't get the logs into HDFS
> using only a channel.
>
>
> *4) do we have a choice of directly synching the weblogs to HDFS store and
> not let the webserver right locally? what is the best practice?*
>
>
> If for example you're using Apache you could configure apache to send the
> logs directly to syslog which would forward them to the listening Flume
> syslog source on a remote server which would then write the logs to HDFS
> using the HDFS sink over a memory channel.  In this case you could avoid
> having the logs written to disk but if one part of the data flow goes down
> (e.g., the flume agent crashes) you will lose log data.  You could switch
> to a file channel which is durable and would help minimize the risk of data
> loss.  If you don't care about potential data loss then memory channel is
> much faster and a bit easier to setup.
>
>
> *5) what setup will that be where i would let the flume, sync a local
> datadire on weblogs, and sync it as soon as the date arrives to this
> directory?*
>
>
> You would want to use a file spooler source to read the log directory then
> send to a collector using the avrosource/sink.
>
>
> *6) do i need a dedicated flume server for this setup?*
>
>
> It depends on what else the flume server is doing.  Personally I think
> it's much easier if you dedicate a box to the task as you don't have to
> worry about resource contention and monitoring becomes easier.  In
> addition, if you use the file channel you will want dedicated disks for
> that purpose.  Note that I'm referring to your collector/storage tier.
>  Obviously if you use a flume agent on the webserver it will not be a
> dedicated box but this shouldn't be an issue as that agent is only
> responsible for collecting logs off a single machine and forwarding them on
> (this blog post has some good info on tuning and topology design:
> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1)
>
>
> *7) if i do use  memory based channel and then do HDFS sync do I need a
> dedicated server, or can run those agents on the webserver itself, provided
> there is enough memory OR would it be recommended to position my config to
> a centralize flume server and the establish the sync.*
>
>
> I would not recommend running flume agents on all the webservers with HDFS
> sink.  It seems much better to funnel the logs to 1 or more agents that
> write to HDFS but not have all 50 webservers writing themselves.
>
>
> *8) how should we do the capacity planning for a memory based channel?*
>
>
> You have to decide how long you want to be able to hold data in the memory
> channel in the event a downstream agent does down (or the HDFS sink gets
> backed up).  Once you have that value you need to figure out what your
> average event size is and the rate at which you are collecting events.
>  This will give you a rough idea.  I'm sure there is some per event memory
> overhead as well (but I don't know the exact value for that).  If you're
> using Cloudera Manager you can monitor the memory channel usage directly
> from the Cloudera Manager interface which is very useful.
>
>
> *9) how should we do the capacity planning for a file based channel ?*
>
>
> Assuming you're referring to heap memory, I think I saw in a different
> thread that you need 32 bytes per event you want to store (the channel
> capacity) + whatever Flume core will use. So if your channel capacity is 1
> million events you will need ~32MB of heap space + 100-500MB for Flume
> core.  You will of course need enough disk space to store the actual logs
> themselves.
>
>
> Best,
>
>
> Ed
>
>
>
>
>
> On Thu, Feb 6, 2014 at 6:22 AM, Asim Zafir <asim.zafir@gmail.com> wrote:
>
>> Flume Users,
>>
>>
>> Here is the problem statement, will be very much interested to have your
>> valuable input and feedback on the following:
>>
>>
>> *Assuming that fact that we generate  200GB of logs PER DAY from 50
>> webservers *
>>
>>
>>
>> Goal is to sync that to HDFS repository
>>
>>
>>
>>
>>
>> 1) do all the webserver in our case needs to run a flume agent?
>>
>> 2) do all the webserver will be acting as source in our setup ?
>>
>> 3) can we sync webservers logs directly to HDFS store by passing channels?
>>
>> 4) do we have a choice of directly synching the weblogs to HDFS store and
>> not let the webserver right locally? what is the best practice?
>>
>> 5) what setup will that be where i would let the flume, sync a local
>> datadire on weblogs, and sync it as soon as the date arrives to this
>> directory?
>>
>> 6) do i need a dedicated flume server for this setup?
>>
>> 7) if i do use  memory based channel and then do HDFS sync do I need a
>> dedicated server, or can run those agents on the webserver itself, provided
>> there is enough memory OR would it be recommended to position my config to
>> a centralize flume server and the establish the sync.
>>
>> 8) how should we do the capacity planning for a memory based channel?
>>
>> 9) how should we do the capacity planning for a file based channel ?
>>
>>
>>
>> sincerely,
>>
>> AZ
>>
>
>

Mime
View raw message