Return-Path: X-Original-To: apmail-flume-user-archive@www.apache.org Delivered-To: apmail-flume-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BEB1117804 for ; Mon, 9 Mar 2015 06:14:12 +0000 (UTC) Received: (qmail 55227 invoked by uid 500); 9 Mar 2015 06:14:07 -0000 Delivered-To: apmail-flume-user-archive@flume.apache.org Received: (qmail 55176 invoked by uid 500); 9 Mar 2015 06:14:07 -0000 Mailing-List: contact user-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flume.apache.org Delivered-To: mailing list user@flume.apache.org Received: (qmail 55166 invoked by uid 99); 9 Mar 2015 06:14:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Mar 2015 06:14:07 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of linlma@gmail.com designates 209.85.216.171 as permitted sender) Received: from [209.85.216.171] (HELO mail-qc0-f171.google.com) (209.85.216.171) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Mar 2015 06:14:03 +0000 Received: by qcxr5 with SMTP id r5so4223452qcx.4 for ; Sun, 08 Mar 2015 23:12:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=x/t8XZ3HYJQL9q4qDsBB2csZMiIPaMOvZOyszJftb9o=; b=CdW/4IWSX9sh8m2ikmF4gMFzw+1m+xfabJMbDHBfWV/PEgfLujCcIVIFJgAK8fcabI eKyBdw8Cb2LJVyKDh6qFnibZvFDdOyvmabCdIEpNbs1WQ0pSYwrXvtvX859GizSuPyb4 Qvsy1IacPJMn6woRk4+7Fyl75w2y89dBUvcGG31doz5qCNiGVmxMSmZPUHDE+HRAnfle CA/WTUMlMKBkM75+m2huOnzhphcY7lKMtBPKO1miOAhfqRWaIYT7gHrTmYcLJghPlLcz fXRjhd2sPWGxpK0ZiVkGN4YbmK6Rm03sxRndXt0oJsS/dirSNMKpTE6h6zQerT4nN/uM sJAQ== MIME-Version: 1.0 X-Received: by 10.229.207.198 with SMTP id fz6mr34229360qcb.27.1425881577492; Sun, 08 Mar 2015 23:12:57 -0700 (PDT) Received: by 10.140.95.43 with HTTP; Sun, 8 Mar 2015 23:12:57 -0700 (PDT) In-Reply-To: References: Date: Sun, 8 Mar 2015 23:12:57 -0700 Message-ID: Subject: Re: beginner's question -- file source configuration From: Lin Ma To: user@flume.apache.org Content-Type: multipart/alternative; boundary=089e0122eb9ceb146e0510d4ec85 X-Virus-Checked: Checked by ClamAV on apache.org --089e0122eb9ceb146e0510d4ec85 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks Gwen, For your comments " if one collector is down, the client can connect to another" in #3, how it related to the two-tier architecture? And client and collector in this case means? regards, Lin On Sun, Mar 8, 2015 at 10:42 PM, Gwen Shapira wrote= : > There are several benefits to the two tier architecture: > > 1. Limit number of processes writing to HDFS. As you correctly > mentioned, there are some limitations there. > 2. Enable us to create larger files faster. (We want to switch files > on HDFS fast to allow querying new data faster, but we also don't want > gazillion small files) > 3. Two tier architecture can support high availability and load > balancing - if one collector is down, the client can connect to > another. > > Gwen > > On Sun, Mar 8, 2015 at 10:30 PM, Lin Ma wrote: > > Thanks Gwen, > > > > Using two-tier architecture of Flume is for the purpose of reduce the > number > > of processes written to HDFS? Remember if too many processes written to > > HDFS, name node will have issues. > > > > regards, > > Lin > > > > On Sun, Mar 8, 2015 at 8:26 PM, Gwen Shapira > wrote: > >> > >> As stated in the docs, you'll need to have the timestamp in the event > >> header for HDFS to automatically place the events in the correct > >> directory. > >> This can be done using the timestamp interceptor. > >> > >> You can see an example here: > >> > >> > https://github.com/hadooparchitecturebook/hadoop-arch-book/tree/master/ch= 09-clickstream/Flume > >> > >> This example uses 2-tier architecture (i.e. one flume agent collecting > >> logs from web servers and the other writing to HDFS). > >> However, you can see how in client.conf the spooling-directory source > >> is configured with timestamp interceptor and in collector.conf the > >> HDFS sink has a parameterized target directory with the timestamp in > >> it. > >> > >> Gwen > >> > >> > >> Gwen > >> > >> On Sun, Mar 8, 2015 at 7:56 PM, Lin Ma wrote: > >> > Thanks Ashish, > >> > > >> > One further question on HDFS sink. If I configure the destination > >> > directory > >> > on HDFS to be Year Month Day Hour, etc. pattern, Flume will put the > data > >> > event it received automatically to the related directory and created > new > >> > directory with time elapsed further? Or I have to setup some Key/Val= ue > >> > headers event in order for HDFS sink to recognize event time and put > >> > into > >> > appropriate time based folder? > >> > > >> > regards, > >> > Lin > >> > > >> > On Sun, Mar 8, 2015 at 6:32 PM, Ashish > wrote: > >> >> > >> >> Your understanding is correct :) > >> >> > >> >> On Mon, Mar 9, 2015 at 6:54 AM, Lin Ma wrote: > >> >> > Thanks Ashish, > >> >> > > >> >> > Followed your guidance, and found below instructions of which hav= e > >> >> > further > >> >> > questions to confirm with you, it seems we need to close the file= s > >> >> > and > >> >> > never > >> >> > touch it for Flume to process correctly, so not sure if it is goo= d > >> >> > practice > >> >> > that -- (1) let the application write log file in existing way, > like > >> >> > hourly > >> >> > or 5 mins pattern, (2) close and move the files to another > directory > >> >> > as > >> >> > input Source for Flume Agent which Flume could process as Spoolin= g > >> >> > Directory? > >> >> > > >> >> > =E2=80=9CThis source will watch the specified directory for new f= iles, and > >> >> > will > >> >> > parse events out of new files as they appear. =E2=80=9D > >> >> > > >> >> > " > >> >> > > >> >> > If a file is written to after being placed into the spooling > >> >> > directory, > >> >> > Flume will print an error to its log file and stop processing. > >> >> > If a file name is reused at a later time, Flume will print an err= or > >> >> > to > >> >> > its > >> >> > log file and stop processing. > >> >> > > >> >> > " > >> >> > > >> >> > regards, > >> >> > Lin > >> >> > > >> >> > On Sun, Mar 8, 2015 at 12:23 AM, Ashish > >> >> > wrote: > >> >> >> > >> >> >> Please look at following > >> >> >> Spooling Directory Source > >> >> >> > >> >> >> [ > http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source] > >> >> >> and > >> >> >> HDFS Sink (http://flume.apache.org/FlumeUserGuide.html#hdfs-sink= ) > >> >> >> > >> >> >> Spooling Directory Source need immutable files, means files shou= ld > >> >> >> not > >> >> >> be written to once they are being consumed. In short your > >> >> >> application > >> >> >> cannot write to the file being read by Flume. > >> >> >> > >> >> >> Log format is not an issue, as long as you don't want it to be > >> >> >> interpreted by Flume components. Since it's log assuming single > log > >> >> >> per line with line separator at the end of line. > >> >> >> > >> >> >> You can also look at Exec source > >> >> >> (http://flume.apache.org/FlumeUserGuide.html#exec-source) for > >> >> >> tailing > >> >> >> to a file being written by application. Documentation covers > details > >> >> >> on all the links. > >> >> >> > >> >> >> HTH ! > >> >> >> > >> >> >> > >> >> >> On Sun, Mar 8, 2015 at 12:32 PM, Lin Ma wrote= : > >> >> >> > Hi Flume masters, > >> >> >> > > >> >> >> > I want to install Flume on a box, and consume local log file a= s > >> >> >> > source > >> >> >> > and > >> >> >> > send to remote HDFS sink. The log format is private and text > (not > >> >> >> > Avro > >> >> >> > or > >> >> >> > JSON format). > >> >> >> > > >> >> >> > I am reading the guide on Flume and many advanced Source > >> >> >> > configuration, > >> >> >> > wondering for the plain local log file source, any reference > >> >> >> > samples? > >> >> >> > And > >> >> >> > not sure if Flume could consume the local file while the > >> >> >> > application > >> >> >> > is > >> >> >> > still writing the log file? Thanks. > >> >> >> > > >> >> >> > regards, > >> >> >> > Lin > >> >> >> > >> >> >> > >> >> >> > >> >> >> -- > >> >> >> thanks > >> >> >> ashish > >> >> >> > >> >> >> Blog: http://www.ashishpaliwal.com/blog > >> >> >> My Photo Galleries: http://www.pbase.com/ashishpaliwal > >> >> > > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> thanks > >> >> ashish > >> >> > >> >> Blog: http://www.ashishpaliwal.com/blog > >> >> My Photo Galleries: http://www.pbase.com/ashishpaliwal > >> > > >> > > > > > > --089e0122eb9ceb146e0510d4ec85 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks Gwen,

For your comments "=C2=A0if= one collector is down, the client can connect to=C2=A0another" in #3, how it related to the two-t= ier architecture? And client and collector in this case means?
regards,
Lin
<= br>
On Sun, Mar 8, 2015 at 10:42 PM, Gwen Shapira= <gshapira@cloudera.com> wrote:
There are several benefits to the two tier architecture:

1. Limit number of processes writing to HDFS. As you correctly
mentioned, there are some limitations there.
2. Enable us to create larger files faster. (We want to switch files
on HDFS fast to allow querying new data faster, but we also don't want<= br> gazillion small files)
3. Two tier architecture can support high availability and load
balancing - if one collector is down, the client can connect to
another.

Gwen

On Sun, Mar 8, 2015 at 10:30 PM, Lin Ma <linlma@gmail.com> wrote:
> Thanks Gwen,
>
> Using two-tier architecture of Flume is for the purpose of reduce the = number
> of processes written to HDFS? Remember if too many processes written t= o
> HDFS, name node will have issues.
>
> regards,
> Lin
>
> On Sun, Mar 8, 2015 at 8:26 PM, Gwen Shapira <gshapira@cloudera.com> wrote:
>>
>> As stated in the docs, you'll need to have the timestamp in th= e event
>> header for HDFS to automatically place the events in the correct >> directory.
>> This can be done using the timestamp interceptor.
>>
>> You can see an example here:
>>
>> https://github.co= m/hadooparchitecturebook/hadoop-arch-book/tree/master/ch09-clickstream/Flum= e
>>
>> This example uses 2-tier architecture (i.e. one flume agent collec= ting
>> logs from web servers and the other writing to HDFS).
>> However, you can see how in client.conf the spooling-directory sou= rce
>> is configured with timestamp interceptor and in collector.conf the=
>> HDFS sink has a parameterized target directory with the timestamp = in
>> it.
>>
>> Gwen
>>
>>
>> Gwen
>>
>> On Sun, Mar 8, 2015 at 7:56 PM, Lin Ma <linlma@gmail.com> wrote:
>> > Thanks Ashish,
>> >
>> > One further question on HDFS sink. If I configure the destina= tion
>> > directory
>> > on HDFS to be Year Month Day Hour, etc. pattern, Flume will p= ut the data
>> > event it received automatically to the related directory and = created new
>> > directory with time elapsed further? Or I have to setup some = Key/Value
>> > headers event in order for HDFS sink to recognize event time = and put
>> > into
>> > appropriate time based folder?
>> >
>> > regards,
>> > Lin
>> >
>> > On Sun, Mar 8, 2015 at 6:32 PM, Ashish <paliwalashish@gmail.com> wrote:
>> >>
>> >> Your understanding is correct :)
>> >>
>> >> On Mon, Mar 9, 2015 at 6:54 AM, Lin Ma <linlma@gmail.com> wrote:
>> >> > Thanks Ashish,
>> >> >
>> >> > Followed your guidance, and found below instructions= of which have
>> >> > further
>> >> > questions to confirm with you, it seems we need to c= lose the files
>> >> > and
>> >> > never
>> >> > touch it for Flume to process correctly, so not sure= if it is good
>> >> > practice
>> >> > that -- (1) let the application write log file in ex= isting way, like
>> >> > hourly
>> >> > or 5 mins pattern, (2) close and move the files to a= nother directory
>> >> > as
>> >> > input Source for Flume Agent which Flume could proce= ss as Spooling
>> >> > Directory?
>> >> >
>> >> > =E2=80=9CThis source will watch the specified direct= ory for new files, and
>> >> > will
>> >> > parse events out of new files as they appear. =E2=80= =9D
>> >> >
>> >> > "
>> >> >
>> >> > If a file is written to after being placed into the = spooling
>> >> > directory,
>> >> > Flume will print an error to its log file and stop p= rocessing.
>> >> > If a file name is reused at a later time, Flume will= print an error
>> >> > to
>> >> > its
>> >> > log file and stop processing.
>> >> >
>> >> > "
>> >> >
>> >> > regards,
>> >> > Lin
>> >> >
>> >> > On Sun, Mar 8, 2015 at 12:23 AM, Ashish <paliwalashish@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Please look at following
>> >> >> Spooling Directory Source
>> >> >>
>> >> >> [http://flume.apache.or= g/FlumeUserGuide.html#spooling-directory-source]
>> >> >> and
>> >> >> HDFS Sink (http://flume.apache.org/Flum= eUserGuide.html#hdfs-sink)
>> >> >>
>> >> >> Spooling Directory Source need immutable files, = means files should
>> >> >> not
>> >> >> be written to once they are being consumed. In s= hort your
>> >> >> application
>> >> >> cannot write to the file being read by Flume. >> >> >>
>> >> >> Log format is not an issue, as long as you don&#= 39;t want it to be
>> >> >> interpreted by Flume components. Since it's = log assuming single log
>> >> >> per line with line separator at the end of line.=
>> >> >>
>> >> >> You can also look at Exec source
>> >> >> (http://flume.apache.org/FlumeUserGui= de.html#exec-source) for
>> >> >> tailing
>> >> >> to a file being written by application. Document= ation covers details
>> >> >> on all the links.
>> >> >>
>> >> >> HTH !
>> >> >>
>> >> >>
>> >> >> On Sun, Mar 8, 2015 at 12:32 PM, Lin Ma <linlma@gmail.com> wrote:
>> >> >> > Hi Flume masters,
>> >> >> >
>> >> >> > I want to install Flume on a box, and consu= me local log file as
>> >> >> > source
>> >> >> > and
>> >> >> > send to remote HDFS sink. The log format is= private and text (not
>> >> >> > Avro
>> >> >> > or
>> >> >> > JSON format).
>> >> >> >
>> >> >> > I am reading the guide on Flume and many ad= vanced Source
>> >> >> > configuration,
>> >> >> > wondering for the plain local log file sour= ce, any reference
>> >> >> > samples?
>> >> >> > And
>> >> >> > not sure if Flume could consume the local f= ile while the
>> >> >> > application
>> >> >> > is
>> >> >> > still writing the log file? Thanks.
>> >> >> >
>> >> >> > regards,
>> >> >> > Lin
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> thanks
>> >> >> ashish
>> >> >>
>> >> >> Blog: http://www.ashishpaliwal.com/blog
>> >> >> My Photo Galleries: http://www.pbase.com/ashishpaliwal=
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> thanks
>> >> ashish
>> >>
>> >> Blog: http://www.ashishpaliwal.com/blog
>> >> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>> >
>> >
>
>

--089e0122eb9ceb146e0510d4ec85--