flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lin Ma <lin...@gmail.com>
Subject Re: beginner's question -- file source configuration
Date Mon, 09 Mar 2015 05:30:39 GMT
Thanks Gwen,

Using two-tier architecture of Flume is for the purpose of reduce the
number of processes written to HDFS? Remember if too many processes written
to HDFS, name node will have issues.

regards,
Lin

On Sun, Mar 8, 2015 at 8:26 PM, Gwen Shapira <gshapira@cloudera.com> wrote:

> As stated in the docs, you'll need to have the timestamp in the event
> header for HDFS to automatically place the events in the correct
> directory.
> This can be done using the timestamp interceptor.
>
> You can see an example here:
>
> https://github.com/hadooparchitecturebook/hadoop-arch-book/tree/master/ch09-clickstream/Flume
>
> This example uses 2-tier architecture (i.e. one flume agent collecting
> logs from web servers and the other writing to HDFS).
> However, you can see how in client.conf the spooling-directory source
> is configured with timestamp interceptor and in collector.conf the
> HDFS sink has a parameterized target directory with the timestamp in
> it.
>
> Gwen
>
>
> Gwen
>
> On Sun, Mar 8, 2015 at 7:56 PM, Lin Ma <linlma@gmail.com> wrote:
> > Thanks Ashish,
> >
> > One further question on HDFS sink. If I configure the destination
> directory
> > on HDFS to be Year Month Day Hour, etc. pattern, Flume will put the data
> > event it received automatically to the related directory and created new
> > directory with time elapsed further? Or I have to setup some Key/Value
> > headers event in order for HDFS sink to recognize event time and put into
> > appropriate time based folder?
> >
> > regards,
> > Lin
> >
> > On Sun, Mar 8, 2015 at 6:32 PM, Ashish <paliwalashish@gmail.com> wrote:
> >>
> >> Your understanding is correct :)
> >>
> >> On Mon, Mar 9, 2015 at 6:54 AM, Lin Ma <linlma@gmail.com> wrote:
> >> > Thanks Ashish,
> >> >
> >> > Followed your guidance, and found below instructions of which have
> >> > further
> >> > questions to confirm with you, it seems we need to close the files and
> >> > never
> >> > touch it for Flume to process correctly, so not sure if it is good
> >> > practice
> >> > that -- (1) let the application write log file in existing way, like
> >> > hourly
> >> > or 5 mins pattern, (2) close and move the files to another directory
> as
> >> > input Source for Flume Agent which Flume could process as Spooling
> >> > Directory?
> >> >
> >> > “This source will watch the specified directory for new files, and
> will
> >> > parse events out of new files as they appear. ”
> >> >
> >> > "
> >> >
> >> > If a file is written to after being placed into the spooling
> directory,
> >> > Flume will print an error to its log file and stop processing.
> >> > If a file name is reused at a later time, Flume will print an error to
> >> > its
> >> > log file and stop processing.
> >> >
> >> > "
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> > On Sun, Mar 8, 2015 at 12:23 AM, Ashish <paliwalashish@gmail.com>
> wrote:
> >> >>
> >> >> Please look at following
> >> >> Spooling Directory Source
> >> >> [
> http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source]
> >> >> and
> >> >> HDFS Sink (http://flume.apache.org/FlumeUserGuide.html#hdfs-sink)
> >> >>
> >> >> Spooling Directory Source need immutable files, means files should
> not
> >> >> be written to once they are being consumed. In short your application
> >> >> cannot write to the file being read by Flume.
> >> >>
> >> >> Log format is not an issue, as long as you don't want it to be
> >> >> interpreted by Flume components. Since it's log assuming single log
> >> >> per line with line separator at the end of line.
> >> >>
> >> >> You can also look at Exec source
> >> >> (http://flume.apache.org/FlumeUserGuide.html#exec-source) for
> tailing
> >> >> to a file being written by application. Documentation covers details
> >> >> on all the links.
> >> >>
> >> >> HTH !
> >> >>
> >> >>
> >> >> On Sun, Mar 8, 2015 at 12:32 PM, Lin Ma <linlma@gmail.com> wrote:
> >> >> > Hi Flume masters,
> >> >> >
> >> >> > I want to install Flume on a box, and consume local log file as
> >> >> > source
> >> >> > and
> >> >> > send to remote HDFS sink. The log format is private and text (not
> >> >> > Avro
> >> >> > or
> >> >> > JSON format).
> >> >> >
> >> >> > I am reading the guide on Flume and many advanced Source
> >> >> > configuration,
> >> >> > wondering for the plain local log file source, any reference
> samples?
> >> >> > And
> >> >> > not sure if Flume could consume the local file while the
> application
> >> >> > is
> >> >> > still writing the log file? Thanks.
> >> >> >
> >> >> > regards,
> >> >> > Lin
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> thanks
> >> >> ashish
> >> >>
> >> >> Blog: http://www.ashishpaliwal.com/blog
> >> >> My Photo Galleries: http://www.pbase.com/ashishpaliwal
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> thanks
> >> ashish
> >>
> >> Blog: http://www.ashishpaliwal.com/blog
> >> My Photo Galleries: http://www.pbase.com/ashishpaliwal
> >
> >
>

Mime
View raw message