flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Chavez <pcha...@verticalsearchworks.com>
Subject RE: distributed weblogs ingestion on HDFS via flume
Date Thu, 06 Feb 2014 03:08:40 GMT
Hi Asim,

I have a similar use case that has been in production for about a year. We have 6 web servers
sending about 15GB a day of web server logs to an 11 node Hadoop cluster. Additionally those
same hosts send another few GB of data from the web applications themselves to HDFS using
flume. I would say in total we send about 120GB a day to HDFS from those 6 hosts. The web
servers each run a local flume agent with an identical config.  I will just describe our configuration
as I think it answers most of your questions.

The web logs are IIS log files that roll once a day and are written to continuously. We wanted
a relatively low latency from log file to HDFS so we use a scheduled task to create an incremental
'diff' file every minute and place it in a spool directory. From there a spoolDir source on
the local agent processes the file. The same task that creates the diff files also cleans
up ones that have been processed by flume. This spoolDir source is just for weblogs and has
no special routing, it uses a file channel and Avro sinks to connect to HDFS. There are two
sinks in a failover configuration using a sink group that send to flume agents co-located
on HDFS data nodes.

The application logs are JSON data that is sent directly from the application to flume via
an httpSource. Again, the logs are sent to the same local agent. We have about a dozen separate
log streams in total from our application but we send all these events to the one httpSource.
Using headers we split into 'high' and 'low' priority streams using a multiplexing channel
selector. We don't really do any special handling of the high priority events, we just watch
one of them a LOT closer than the other. ;)  These channels are also drained by an Avro sink
and are sent to a different pair of flume agents, again co-located with data nodes.

The flume agents that run on the data nodes just have avro sources, file channels and HDFS
sinks. We have found two HDFS sinks (without any sink groups) can be necessary to keep the
channels from backing up in some cases. The amount of file channels on these agents varies,
some log streams are split into their own channels first and many of them use a 'catch all'
channel that uses tokenized paths on the sink to write the data to different locations in
HDFS according to header values.  The HDFS sinks all write DataStream files with format Text
and bucket into Hive friendly partitioned directories by date and hour. The JSON events are
one line each and we use a custom Hive SerDe for the JSON data. We use Hive external table
definitions to read the data and use Oozie to process every log stream hourly into Snappy
compressed internal Hive tables and then drop the raw data in 8 days. We don't use Impala
all that much as most of our workflows just crunch the data and push it back into a SQL DW
for the data people.

Here is a good start for capacity planning: https://cwiki.apache.org/confluence/display/FLUME/Flume%27s+Memory+Consumption.

We have gotten away with default channel sizes (1 million) so far without issue. We do try
to separate the file channels to different physical disks as much as we can to optimize our

Hope that helps,
Paul Chavez

From: Asim Zafir [mailto:asim.zafir@gmail.com]
Sent: Wednesday, February 05, 2014 1:23 PM
To: user@flume.apache.org
Subject: distributed weblogs ingestion on HDFS via flume

Flume Users,

Here is the problem statement, will be very much interested to have your valuable input and
feedback on the following:

Assuming that fact that we generate  200GB of logs PER DAY from 50 webservers

Goal is to sync that to HDFS repository

1) do all the webserver in our case needs to run a flume agent?
2) do all the webserver will be acting as source in our setup ?
3) can we sync webservers logs directly to HDFS store by passing channels?
4) do we have a choice of directly synching the weblogs to HDFS store and not let the webserver
right locally? what is the best practice?
5) what setup will that be where i would let the flume, sync a local datadire on weblogs,
and sync it as soon as the date arrives to this directory?
6) do i need a dedicated flume server for this setup?
7) if i do use  memory based channel and then do HDFS sync do I need a dedicated server, or
can run those agents on the webserver itself, provided there is enough memory OR would it
be recommended to position my config to a centralize flume server and the establish the sync.
8) how should we do the capacity planning for a memory based channel?
9) how should we do the capacity planning for a file based channel ?


View raw message