Return-Path: X-Original-To: apmail-flume-user-archive@www.apache.org Delivered-To: apmail-flume-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 57B60E5AE for ; Fri, 18 Jan 2013 02:21:24 +0000 (UTC) Received: (qmail 82552 invoked by uid 500); 18 Jan 2013 02:21:24 -0000 Delivered-To: apmail-flume-user-archive@flume.apache.org Received: (qmail 82500 invoked by uid 500); 18 Jan 2013 02:21:24 -0000 Mailing-List: contact user-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flume.apache.org Delivered-To: mailing list user@flume.apache.org Received: (qmail 82491 invoked by uid 99); 18 Jan 2013 02:21:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jan 2013 02:21:24 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cwoodson.dev@gmail.com designates 209.85.212.175 as permitted sender) Received: from [209.85.212.175] (HELO mail-wi0-f175.google.com) (209.85.212.175) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jan 2013 02:21:17 +0000 Received: by mail-wi0-f175.google.com with SMTP id hm11so4943565wib.8 for ; Thu, 17 Jan 2013 18:20:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=hhEj0rc/fSwUdCJ0bw2BX1T6HuuZc7vTRTSVu7lnV0M=; b=HAJD6qA/pfw87EYz3X64Hxb/5g+3Y+pfWsxYi47M+drDcGBMPfcfFwTjv/l8QdP98y 4XK+fOmhOEOlIqGjbOEg5yWuWnaqfYITaQeV1viA0MSoonYKjKjlckpLSooNbT7FukcR uCsCzQlRle47iq+Y3KVKL/lmvzFJfCyargzyu5gn9GmtkM+g2B+5oPpSl662SPTBm6az 4ivsySkBVnhtFZ3rbwv+Mt1MU026NORay5KkwfVzI4RpVL4IMx84+XbKn/l53oIYx7A0 gooce6vJfpkF3DpeaSOscwAS95whYE3NBMm4tAKJ+PlI7AlAqeI+NHGSzG68B7bG0GW7 Qfgg== MIME-Version: 1.0 X-Received: by 10.180.24.9 with SMTP id q9mr1112537wif.14.1358475657077; Thu, 17 Jan 2013 18:20:57 -0800 (PST) Received: by 10.227.2.196 with HTTP; Thu, 17 Jan 2013 18:20:56 -0800 (PST) In-Reply-To: References: <50F8AEB1.4000302@cyberagent.co.jp> Date: Thu, 17 Jan 2013 18:20:56 -0800 Message-ID: Subject: Re: hdfs.idleTimeout ,what's it used for ? From: Connor Woodson To: "user@flume.apache.org" Content-Type: multipart/alternative; boundary=f46d043892d9f9e2c104d386c20f X-Virus-Checked: Checked by ClamAV on apache.org --f46d043892d9f9e2c104d386c20f Content-Type: text/plain; charset=ISO-8859-1 @Mohit: For the HDFS Sink, the tmp files are placed based on the hadoop.tmp.dir property. The default location is /tmp/hadoop-${user.name} To change this you can add -Dhadoop.tmp.dir= to your Flume command line call, or you can specify the property in the core-site.xml of wherever your HADOOP_HOME environment variable points to. - Connor On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson wrote: > Whether idleTimeout is lower or higher than rollInterval is a preference; > set it before, and assume you get one message right on the turn of the > hour, then you will have some part of that hour without any bucket writers; > but if you get another message at the end of the hour, you will end up with > two files instead of one. Set it idleTimeout to be longer and you will get > just one file, but also (at worst case) you will have twice as many > bucketwriters open; so it all depends on how many files you want/how much > memory you have to spare. > > - Connor > > An aside: > bucketwriters, after being closed by rollInterval, aren't really a memory > leak; they just are very rarely useful to keep around (your path could rely > on hostname, and you could use a rollinterval, and then those bucketwriters > will still remain useful). And they will get removed eventually; by default > after you've created your 5001st bucketwriter, the first (or whichever was > used longest ago) will be removed. > > And I don't think that's the cause behind 1850 as he did have an > idleTimeout set at 15 minutes. > > > On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly < > juhani_connolly@cyberagent.co.jp> wrote: > >> It's also useful if you want files to get promptly closed and renamed >> from the .tmp or whatever. >> >> We use it with something like 30seconds setting(we have a constant stream >> of data) and hourly bucketing. >> >> There is also the issue that files closed by rollInterval are never >> removed from the internal linkedList so it actually causes a small memory >> leak(which can get big in the long term if you have a lot of files and >> hourly renames). I believe this is what is causing the OOM Mohit is getting >> in FLUME-1850 >> >> So I personally would recommend using it(with a setting that will close >> files before rollInterval does). >> >> >> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote: >> >>> Ah I see. Again something useful to have in the flume user guide. >>> >>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson >>> wrote: >>> >>>> the rollInterval will still cause the last 01-17 file to be closed >>>> eventually. The way the HDFS sink works with the different files is each >>>> unique path is specified by a different BucketWriter object. The sink >>>> can >>>> hold as many objects as specified by hdfs.maxOpenWorkers (default: >>>> 5000), >>>> and bucketwriters are only removed when you create the 5001th writer >>>> (5001th >>>> unique path). However, generally once a writer is closed it is never >>>> used >>>> again (all of your 1-17 writers will never be used again). To avoid >>>> keeping >>>> them in the sink's internal list of writers, the idleTimeout is a >>>> specified >>>> number of seconds in which no data is received by the BucketWriter. >>>> After >>>> this time, the writer will try to close itself and will then tell the >>>> sink >>>> to remove it, thus freeing up everything used by the bucketwriter. >>>> >>>> So the idleTimeout is just a setting to help limit memory usage by the >>>> hdfs >>>> sink. The ideal time for it is longer than the maximum time between >>>> events >>>> (capped at the rollInterval) - if you know you'll receive a constant >>>> stream >>>> of events you might just set it to a minute or something. Or if you are >>>> fine >>>> with having multiple files open per hour, you can set it to a lower >>>> number; >>>> maybe just over the average time between events. For me in just >>>> testing, I >>>> set it >= rollInterval for the cases when no events are received in a >>>> given >>>> hour (I'd rather keep the object alive for an extra hour than create >>>> files >>>> every 30 minutes or something). >>>> >>>> Hope that was helpful, >>>> >>>> - Connor >>>> >>>> >>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar >>>> wrote: >>>> >>>>> Say If I have >>>>> >>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/ >>>>> >>>>> hdfs.rollInterval=60 >>>>> >>>>> Now, if there is a file >>>>> /flume/events/2013-01-17/**flume_XXXXXXXXX.tmp >>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not >>>>> up and now it's past 12 midnight, i.e. new day >>>>> And events start to be written to >>>>> /flume/events/2013-01-18/**flume_XXXXXXXX.tmp >>>>> >>>>> will the file 2013-01-17 never be rolled over, unless I have something >>>>> like hdfs.idleTimeout=60 ? >>>>> If so how do flume sinks keep track of files they need to rollover >>>>> after idealTimeout ? >>>>> >>>>> In short what's the exact use of idealTimeout parameter ? >>>>> >>>> >>>> >> > --f46d043892d9f9e2c104d386c20f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
@Mohit:

For the HDFS Sink, the tmp file= s are placed based on the hadoop.tmp.dir property. The default location is = /tmp/hadoop-${user.name} To change this yo= u can add -Dhadoop.tmp.dir=3D<path> to your Flume command line call, = or you can specify the property in the core-site.xml of wherever your HADOO= P_HOME environment variable points to.

- Connor


=
On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson = <cwoodson.dev@gmail.com> wrote:
Whether idleTimeout is= lower or higher than rollInterval is a preference; set it before, and assu= me you get one message right on the turn of the hour, then you will have so= me part of that hour without any bucket writers; but if you get another mes= sage at the end of the hour, you will end up with two files instead of one.= Set it idleTimeout to be longer and you will get just one file, but also (= at worst case) you will have twice as many bucketwriters open; so it all de= pends on how many files you want/how much memory you have to spare.

- Connor

An aside:
bucketwriter= s, after being closed by rollInterval, aren't really a memory leak; the= y just are very rarely useful to keep around (your path could rely on hostn= ame, and you could use a rollinterval, and then those bucketwriters will st= ill remain useful). And they will get removed eventually; by default after = you've created your 5001st bucketwriter, the first (or whichever was us= ed longest ago) will be removed.

And I don't think that's the cause behind 1850 as he= did have an idleTimeout set at 15 minutes.


On Thu, J= an 17, 2013 at 6:08 PM, Juhani Connolly <juhani_connolly@cy= beragent.co.jp> wrote:
It's also u= seful if you want files to get promptly closed and renamed from the .tmp or= whatever.

We use it with something like 30seconds setting(we have a constant stream o= f data) and hourly bucketing.

There is also the issue that files closed by rollInterval are never removed= from the internal linkedList so it actually causes a small memory leak(whi= ch can get big in the long term if you have a lot of files and hourly renam= es). I believe this is what is causing the OOM Mohit is getting in FLUME-18= 50

So I personally would recommend using it(with a setting that will close fil= es before rollInterval does).


On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
Ah I see. Again something useful to have in the flume user guide.

On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cwoodson.dev@gmail.com> wrote:<= br>
the rollInterval will still cause the last 01-17 file to be closed
eventually. The way the HDFS sink works with the different files is each unique path is specified by a different BucketWriter object. The sink can hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000), and bucketwriters are only removed when you create the 5001th writer (5001t= h
unique path). However, generally once a writer is closed it is never used again (all of your 1-17 writers will never be used again). To avoid keeping=
them in the sink's internal list of writers, the idleTimeout is a speci= fied
number of seconds in which no data is received by the BucketWriter. After this time, the writer will try to close itself and will then tell the sink<= br> to remove it, thus freeing up everything used by the bucketwriter.

So the idleTimeout is just a setting to help limit memory usage by the hdfs=
sink. The ideal time for it is longer than the maximum time between events<= br> (capped at the rollInterval) - if you know you'll receive a constant st= ream
of events you might just set it to a minute or something. Or if you are fin= e
with having multiple files open per hour, you can set it to a lower number;=
maybe just over the average time between events. For me in just testing, I<= br> set it >=3D rollInterval for the cases when no events are received in a = given
hour (I'd rather keep the object alive for an extra hour than create fi= les
every 30 minutes or something).

Hope that was helpful,

- Connor


On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
<bhaskarvk@gmai= l.com> wrote:
Say If I have

a1.sinks.k1.hdfs.path =3D /flume/events/%y-%m-%d/

hdfs.rollInterval=3D60

Now, if there is a file
/flume/events/2013-01-17/flume_XXXXXXXXX.tmp
This file is not ready to be rolled over yet, i.e. 60 seconds are not
up and now it's past 12 midnight, i.e. new day
And events start to be written to
/flume/events/2013-01-18/flume_XXXXXXXX.tmp

will the file 2013-01-17 never be rolled over, unless I have something
like hdfs.idleTimeout=3D60 =A0?
If so how do flume sinks keep track of files they need to rollover
after idealTimeout ?

In short what's the exact use of idealTimeout parameter ?




--f46d043892d9f9e2c104d386c20f--