Return-Path: X-Original-To: apmail-flume-user-archive@www.apache.org Delivered-To: apmail-flume-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E6CFFDF0C for ; Tue, 18 Sep 2012 14:37:51 +0000 (UTC) Received: (qmail 34663 invoked by uid 500); 18 Sep 2012 14:37:51 -0000 Delivered-To: apmail-flume-user-archive@flume.apache.org Received: (qmail 34610 invoked by uid 500); 18 Sep 2012 14:37:51 -0000 Mailing-List: contact user-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flume.apache.org Delivered-To: mailing list user@flume.apache.org Received: (qmail 34599 invoked by uid 99); 18 Sep 2012 14:37:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Sep 2012 14:37:51 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of brock@cloudera.com designates 209.85.160.51 as permitted sender) Received: from [209.85.160.51] (HELO mail-pb0-f51.google.com) (209.85.160.51) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Sep 2012 14:37:44 +0000 Received: by pbbro8 with SMTP id ro8so96190pbb.38 for ; Tue, 18 Sep 2012 07:37:23 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=XeBuuXickt0fxpZIAmQickvByYUlepwVCCOoxZzTYhk=; b=LDidnASceVZ7ypfVH56YtLAh89alMRMnVs4LgOT1BHJmtZ2pbesf+erECwRxg4JJ+i SFdXD/GcCIr2dLYl2UL+bLJtppDsm3pw4rVsVHIqw5VFlnV2z7O4hYtJl+Xx0aDyWgRZ CiSTy0X4fs1UOqCOyrJq7WmD1D+Xz60ZoNu9DiO8AWEaMpnWe74Ahpw7DVFzaqPYjwlN k2Px1KT2WzFd6bwJsuwPs4B6lmfFW34Mlv0ZlacvrP33/d5NlNe41upMUZmhy33GM7Is xSgiulZeDlIBSnbgtdRyHfT+e5GxDDAz+8/axiAWFCrlrRO259qRJxVM6shPmwjPVkg3 jsBA== Received: by 10.68.224.70 with SMTP id ra6mr158094pbc.11.1347979043343; Tue, 18 Sep 2012 07:37:23 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.234.194 with HTTP; Tue, 18 Sep 2012 07:37:01 -0700 (PDT) In-Reply-To: <50586446.3000100@pubmatic.com> References: <50519FD8.9000501@pubmatic.com> <50586446.3000100@pubmatic.com> From: Brock Noland Date: Tue, 18 Sep 2012 09:37:01 -0500 Message-ID: Subject: Re: HDFS file rolling behaviour To: user@flume.apache.org Content-Type: multipart/alternative; boundary=047d7b16051f0b0c3c04c9fad47b X-Gm-Message-State: ALoCoQlQLPF7AKnIISpxzUpfqvrEWksp5sb49e6xcFq4f5tTzV0Zf9kaw0inle/eV1V4R/i/zyCw --047d7b16051f0b0c3c04c9fad47b Content-Type: text/plain; charset=ISO-8859-1 If you have not increased the OS number of open files limit, you should. The default limit of 1024 is too low for nearly every modern application. In regards to the rolling, can you paste you config and describe in more detail the unexpected behavior you are seeing? Brock On Tue, Sep 18, 2012 at 7:08 AM, Jagadish Bihani < jagadish.bihani@pubmatic.com> wrote: > Hi > > Does anybody know about the issue mentioned in the following mail? > > > Update: I have seen following behaviour now even for time based rolling. > By time based rolling I would expect: That single file should be created > after x seconds. > But in my case some n files are created after every x seconds. > Is it something to do with HDFS batch size? > > Regards, > Jagadish > > > > -------- Original Message -------- Subject: HDFS file rolling behaviour Date: > Thu, 13 Sep 2012 14:26:56 +0530 From: Jagadish Bihani > To: > user@flume.apache.org > > Hi > > I use two flume agents: > 1. flume_agent 1 which is a source with (exec source -file channel -avro > sink) > 2. flume_agent 2 which is a dest with (avro source -file channel - HDFS > sink) > > I have observed that for HDFS sink with rolling by *file size/number of > events* it > creates a lot of simultaneous connections to source's avro sink. But > while rolling by *time interval* it does it *one by one* i.e. opens 1 > HDFS file write to > it and then close it. I expect for other rolling intervals too same thing > should happen > i.e. first open file and if x number of events are written to it then > roll it and open another > and so on. > > In my case my data ingestion works fine with "time" based rolling but in > other > cases due to the above behaviour I get exceptions like: > -- too many open files > -- timeout related exceptions for file channel and few more exceptions. > > I can increase the values of the parameters giving exceptions but I dont > know what > adverse effects it may have. > > Can somebody throw some light on the rolling based on file size/number of > events ? > > Regards, > Jagadish > > > > -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ --047d7b16051f0b0c3c04c9fad47b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable If you have not increased the OS number of open files limit, you should. Th= e default limit of 1024 is too low for nearly every modern application.
In regards to the rolling, can you paste you config and des= cribe in more detail the unexpected behavior you are seeing?

Brock

On Tue, = Sep 18, 2012 at 7:08 AM, Jagadish Bihani <jagadish.bihani@pubma= tic.com> wrote:
=20 =20 =20
Hi

Does anybody know about=A0 the issue mentioned in the following mail?

Update: I have seen following behaviour now even for time based rolling.
By time based rolling I would expect: That single file should be created after x seconds.
But in my case some n files are created after every x seconds.
Is it something to do with HDFS batch size?

Regards,
Jagadish



-------- Original Message --------
Subject: HDFS file rolling behaviour
Date: Thu, 13 Sep 2012 14:26:56 +0530
From: Jagadish Bihani <jagadish.bihani@pubmatic.com>
To: = user@flume.apache.org


=20 Hi

I use two flume agents:
1. flume_agent 1 which is a source with (exec source -file channel -avro sink)
2. flume_agent 2 which is a dest with (avro source -file channel - HDFS sink)

I have observed that for HDFS sink with rolling by file size/number of events it
creates a lot of simultaneous connections to source's avro sink. But
while rolling by time interval it does it one by one i.e. opens 1 HDFS file write to
it and then close it.=A0 I expect for other rolling intervals too same thing should happen
i.e.=A0 first open file and if x number of events are written to it then roll it and open another
and so on.

In my case my data ingestion works fine with "time" based r= olling but in other
cases due to the above behaviour I get exceptions like:
-- too many open files
-- timeout related exceptions for file channel and few more exceptions.

I can increase the values of the parameters giving exceptions but I dont know what
adverse effects it may have.

Can somebody throw some light on the rolling based on file size/number of events ?

Regards,
Jagadish






--
Apache MRUni= t - Unit testing MapReduce - http://incubator.apache.org/mrunit/
--047d7b16051f0b0c3c04c9fad47b--