Return-Path: X-Original-To: apmail-flume-user-archive@www.apache.org Delivered-To: apmail-flume-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3D54E11DCF for ; Mon, 8 Sep 2014 21:00:27 +0000 (UTC) Received: (qmail 7007 invoked by uid 500); 8 Sep 2014 21:00:26 -0000 Delivered-To: apmail-flume-user-archive@flume.apache.org Received: (qmail 6957 invoked by uid 500); 8 Sep 2014 21:00:26 -0000 Mailing-List: contact user-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flume.apache.org Delivered-To: mailing list user@flume.apache.org Received: (qmail 6945 invoked by uid 99); 8 Sep 2014 21:00:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Sep 2014 21:00:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of malouf.gary@gmail.com designates 209.85.216.181 as permitted sender) Received: from [209.85.216.181] (HELO mail-qc0-f181.google.com) (209.85.216.181) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Sep 2014 21:00:00 +0000 Received: by mail-qc0-f181.google.com with SMTP id i17so16015734qcy.40 for ; Mon, 08 Sep 2014 13:59:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=YnE758NdSBp7Om4jTwGYv6oKcjiI48DYia4FsF/oyPo=; b=DqCASB2ButSPVv2Vgp3TokyvhsZJ93gNKZQaMbLOSH4MyxQHckHmfxHcXQPKtzJyU6 +YykV3puy/m+j/NiDI78TZrrycqoHObclztlsjjjYn5pfcSkRz+WOfVhFNtv1KS40yk8 NRrWYjGhEuH97NXgtEWu0+swrNwh69GRLx80ySGtW9Cy53IVJelmGR7rAD7WELhMDElq 8iNddNeVzj1679IcMawXw6l7/oCVMRDB9zLqipcEjtgJwKRXThTuTN4qwCl28toWZ9mu wBG5Q8m8Y44sULTef83B4gTa5kzyj+uJV8YW/3HUBJpmkacDLkByHtjZy4hQbCHsvH6G fNvw== MIME-Version: 1.0 X-Received: by 10.224.13.141 with SMTP id c13mr8199150qaa.85.1410209999765; Mon, 08 Sep 2014 13:59:59 -0700 (PDT) Received: by 10.140.29.102 with HTTP; Mon, 8 Sep 2014 13:59:59 -0700 (PDT) In-Reply-To: <540E17A8.8020808@cloudera.com> References: <540E0C65.7030401@cloudera.com> <540E17A8.8020808@cloudera.com> Date: Mon, 8 Sep 2014 16:59:59 -0400 Message-ID: Subject: Re: Enabling file channel backup checkpoint causes significant disk IO at start-up From: Gary Malouf To: user Content-Type: multipart/alternative; boundary=047d7bdca4581846450502941ab7 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdca4581846450502941ab7 Content-Type: text/plain; charset=UTF-8 Hi Hari, I'm a colleague of Michael's, if we are in need of a few of these patches, would you recommend we do our own custom build? Separate from Apache's release cycle, would these patches get included in the next CDH build that includes Flume? (Not sure what the schedule of that is...) Thanks, Gary On Mon, Sep 8, 2014 at 4:55 PM, Hari Shreedharan wrote: > Flume releases are once every few months - since we just had one a couple > of months back, I don't think there will be one happening right away. > > Michael Diamant wrote: > > > Hari, thank you for your quick reply. A follow-up question to help me > figure out how best to proceed on my end: Can you provide an estimate > as to when the next Flume release will occur? > > > On Mon, Sep 8, 2014 at 4:07 PM, Hari Shreedharan > > wrote: > > This patch should address the issue, if enabled: > > https://git-wip-us.apache.org/repos/asf?p=flume.git;a=commitdiff;h=69fd6b3ad5e5b9ae6f1293b3d8e57ed57fd6701c;hp=f15f20785262ac3cb3e35c2a12e669b7a836d35f > > It will be part of the next Flume release (or CDH5.2.0). > > -- > > Thanks, > Hari > > > > Michael Diamant > September 8, 2014 at 12:58 PM > My team uses Flume 1.4.0 packaged with CDH5.0.2 via an embedded > agent to write to a file channel. From a previous thread started > by my colleague, "FileChannel Replays consistently take a long > time" and associated issue, > https://issues.apache.org/jira/browse/FLUME-2450, it was > suggested to use a backup checkpoint directory to avoid lengthy > replays. When I enabled the backup checkpoint directory, I > observed via iotop near 100% IO by my application with the > embedded agent. This level of IO persists for about 30 seconds > rendering the application unusable during this time period. > > For comparison, I monitored via iotop when backup checkpoint is > disabled. IO activity occurs for at most several seconds. That > is, there is a qualitative difference when enabling the backup > checkpoint directory. Additionally, I also tried deleting the > existing checkpoints/data directories to start with a clean > slate. Those experiment results are in-line with my above > observations. > > Is this expected behavior when using a backup checkpoint > directory? Is there anyway in which the amount of IO can be > reduced? I appreciate feedback and insights because the current > behavior is untenable for a production environment. > > Thank you, > Michael > > > > --047d7bdca4581846450502941ab7 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Hari,

I'm a colleague of Michael= 's, if we are in need of a few of these patches, would you recommend we= do our own custom build? =C2=A0

Separate from Apa= che's release cycle, would these patches get included in the next CDH b= uild that includes Flume? =C2=A0(Not sure what the schedule of that is...)<= /div>

Thanks,

Gary
<= div class=3D"gmail_extra">

On Mon, Sep 8,= 2014 at 4:55 PM, Hari Shreedharan <hshreedharan@cloudera.com&= gt; wrote:
Flume releases are= once every few months - since we just had one a couple of months back, I d= on't think there will be one happening right away.

Michael Diamant wrote:

Hari, thank you for your quick reply.=C2=A0 A follow-up question to help me=
figure out how best to proceed on my end:=C2=A0 Can you provide an estimate=
as to when the next Flume release will occur?


On Mon, Sep 8, 2014 at 4:07 PM, Hari Shreedharan
<hshreedh= aran@cloudera.com <mailto:hshreedharan@cloudera.com>> wrote:

=C2=A0=C2=A0=C2=A0 This patch should address the issue, if enabled:
=C2=A0=C2=A0=C2=A0 https://git-wip-= us.apache.org/repos/asf?p=3Dflume.git;a=3Dcommitdiff;h=3D69fd6b3ad5e5b9ae6f= 1293b3d8e57ed57fd6701c;hp=3Df15f20785262ac3cb3e35c2a12e669b7a836d35f
=C2=A0=C2=A0=C2=A0 It will be part of the next Flume release (or CDH5.2.0).=

=C2=A0=C2=A0=C2=A0 --

=C2=A0=C2=A0=C2=A0 Thanks,
=C2=A0=C2=A0=C2=A0 Hari



=C2=A0=C2=A0=C2=A0 Michael Diamant <mailto:diamant.michael@gmail.com>
=C2=A0=C2=A0=C2=A0 September 8, 2014 at 12:58 PM
=C2=A0=C2=A0=C2=A0 My team uses Flume 1.4.0 packaged with CDH5.0.2 via an e= mbedded
=C2=A0=C2=A0=C2=A0 agent to write to a file channel.=C2=A0 From a previous = thread started
=C2=A0=C2=A0=C2=A0 by my colleague, "FileChannel Replays consistently = take a long
=C2=A0=C2=A0=C2=A0 time" and associated issue,
=C2=A0=C2=A0=C2=A0 https://issues.apache.org/jira/browse/FLUME-2450= , it was
=C2=A0=C2=A0=C2=A0 suggested to use a backup checkpoint directory to avoid = lengthy
=C2=A0=C2=A0=C2=A0 replays.=C2=A0 When I enabled the backup checkpoint dire= ctory, I
=C2=A0=C2=A0=C2=A0 observed via iotop near 100% IO by my application with t= he
=C2=A0=C2=A0=C2=A0 embedded agent.=C2=A0 This level of IO persists for abou= t 30 seconds
=C2=A0=C2=A0=C2=A0 rendering the application unusable during this time peri= od.

=C2=A0=C2=A0=C2=A0 For comparison, I monitored via iotop when backup checkp= oint is
=C2=A0=C2=A0=C2=A0 disabled.=C2=A0 IO activity occurs for at most several s= econds.=C2=A0 That
=C2=A0=C2=A0=C2=A0 is, there is a qualitative difference when enabling the = backup
=C2=A0=C2=A0=C2=A0 checkpoint directory.=C2=A0 Additionally, I also tried d= eleting the
=C2=A0=C2=A0=C2=A0 existing checkpoints/data directories to start with a cl= ean
=C2=A0=C2=A0=C2=A0 slate.=C2=A0 Those experiment results are in-line with m= y above
=C2=A0=C2=A0=C2=A0 observations.

=C2=A0=C2=A0=C2=A0 Is this expected behavior when using a backup checkpoint=
=C2=A0=C2=A0=C2=A0 directory?=C2=A0 Is there anyway in which the amount of = IO can be
=C2=A0=C2=A0=C2=A0 reduced?=C2=A0 I appreciate feedback and insights becaus= e the current
=C2=A0=C2=A0=C2=A0 behavior is untenable for a production environment.

=C2=A0=C2=A0=C2=A0 Thank you,
=C2=A0=C2=A0=C2=A0 Michael



--047d7bdca4581846450502941ab7--