flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Shreedharan <hshreedha...@apache.org>
Subject Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"
Date Fri, 15 Aug 2014 16:27:31 GMT
What version of Flume are you using?


On Tue, Aug 12, 2014 at 1:51 PM, Mangtani, Kushal <
Kushal.Mangtani@viasat.com> wrote:

>  Bumping this up; to make sure someone answers this.
>
> P.S: let me know if i need to post these questions on a seperate thread.
>
>  Thanks,
> Kushal Mangtani
>
>  ------------------------------
> *From:* Mangtani, Kushal
> *Sent:* Friday, August 08, 2014 12:39 PM
> *To:* user@flume.apache.org
> *Subject:* RE: File Channel Exception "Failed to obtain lock for writing
> to the log.Try increasing the log write timeout value"
>
>   Hello FlumeTeam,
>
>  I have recently seen a bug/weird behaviour in File Channel. I am using
> FileChannel in my prod env; so save me from hickups in my prod. Recently, I
> got my file Channel Full.
> So; the only ways of fixing this was:
>
>    1. restart the flume process.
>    2. twaek the transactionCapacity of fileChannel.
>
> i went with 1) .However, after doing so; my flume ps was stuck and the
> logs were:
>
>  08 Aug 2014 19:03:54,014 INFO  [lifecycleSupervisor-1-4]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:597)  - File
> position exceeds the threshold: 1623195647, position: 1623195649
>
> 08 Aug 2014 19:03:54,015 INFO  [lifecycleSupervisor-1-4]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:608)  -
> Encountered EOF at 1623195649 in
> /usr/lib/flume-ng/datastore/channel1/logs/log-5802
>
>
>  Looks like for some reason FilePointer was at a position > than the
> FileSize. Ultimately; I had to delete the logs,checkpoint,backup-checkpoint
> for my flume process to process events.
>
> Sp; the whole purpose of FileChannel i.e better durability vs average
> performance was defeated here.
>
>
>  Questions:
>
>
>    1. Is there something I can have done to preserve this data Loss ?
>    2. Also; I believ Flume-ng is push -pull mechanism; where source
>    pushes events to channels and sinks pulls events from channels which is
>    contradictory to flume-og (push only mechanism). Correct me if im wrong?
>    Was there a reason for this push-pull architecture in flume-land ?
>
> Thanks,
> Kushal Mangtani
>
>  ------------------------------
> *From:* Hari Shreedharan [hshreedharan@cloudera.com]
> *Sent:* Friday, February 28, 2014 11:38 AM
> *To:* user@flume.apache.org
> *Subject:* Re: File Channel Exception "Failed to obtain lock for writing
> to the log.Try increasing the log write timeout value"
>
>   It is currently in trunk, so it will be in flume 1.5
>
>
> Thanks,
> Hari
>
>  On Friday, February 28, 2014 at 11:30 AM, Mangtani, Kushal wrote:
>
>   Hari,
>
>
>
> Thanks for the feedback.This was really helpful. I am going to use
> provisioned IO for a while to make sure the exception does not comes back.
>
>
>
> Also, from the comments section of the Jira ticket given below, I noticed
> that you were able to identify the reason of the exception perhaps old logs
> are never deleted. Are you guys going to put a patch to in flume 1.5 so
> that this exception is resolved?
>
>
>
> -Kushal mangtani
>
>
>
> *From:* Hari Shreedharan [mailto:hshreedharan@cloudera.com
> <hshreedharan@cloudera.com>]
> *Sent:* Thursday, February 27, 2014 11:19 AM
> *To:* user@flume.apache.org
> *Subject:* Re: File Channel Exception "Failed to obtain lock for writing
> to the log.Try increasing the log write timeout value"
>
>
>
> See https://issues.apache.org/jira/browse/FLUME-2307
> <https://urldefense.proofpoint.com/v1/url?u=https://issues.apache.org/jira/browse/FLUME-2307&k=OWT%2FB14AE7ysJN06F7d2nQ%3D%3D%0A&r=Ige9%2FQENXuGqSGiXpuvHakVLuIySu7e10oNaj%2FGB%2B0I%3D%0A&m=PM9%2FMPLJ2TJ%2Fh%2BBMW%2BqpQ1UrxcZbZNPwx5%2FdhkJpEaw%3D%0A&s=91453e467ee8ed73fb29bace503614ae8091d624bdba0f77dedaf43b18e46c41>
>
>
>
> This jira removed the write-timeout, but that only makes sure that there
> is no transaction in limbo. The real reason like I said is slow IO. Try
> using provisioned IO for better throughput.
>
>
>
>
>
> Thanks,
>
> Hari
>
>
>
> On Thursday, February 27, 2014 at 10:48 AM, Mangtani, Kushal wrote:
>
>   Hari,
>
>
>
> Thanks for the prompt reply. The current file channel’s  write-timeout =
> 30 sec .EBS drive current  capacity = 200 GB . The rate of writes is 60
> events/min; where each event is approx. 40 KB.
>
>
>
> I am thinking of increase file channel write-timeout to 60 sec. What do
> you suggest?
>
> Also,one strange thing I noticed all the flume-collectors  also get the
> same exception.However, all have a separate ebs drive. Any inputs?
>
>
>
> Thanks,
>
> Kushal Mangtani
>
>
>
> *From:* Hari Shreedharan [mailto:hshreedharan@cloudera.com
> <hshreedharan@cloudera.com>]
> *Sent:* Thursday, February 27, 2014 10:35 AM
> *To:* user@flume.apache.org
> *Subject:* Re: File Channel Exception "Failed to obtain lock for writing
> to the log.Try increasing the log write timeout value"
>
>
>
> For now, increase the file channel’s write-timeout parameter to around 30
> or so (basically file channel is timing out while writing to disk). But the
> basic problem you are seeing is that your EBS instance is very slow and IO
> is taking too long. You either need to increase your EBS IO capacity, or
> reduce the rate or writes.
>
>
>
>
>
> Thanks,
>
> Hari
>
>
>
> On Thursday, February 27, 2014 at 10:28 AM, Mangtani, Kushal wrote:
>
>
>
>
>
> *From:* Mangtani, Kushal
> *Sent:* Wednesday, February 26, 2014 4:51 PM
> *To:* 'user@flume.apache.org'; 'user-subscribe@flume.apache.org'
> *Cc:* Rangnekar, Rohit; 'dev@flume.apache.org'
> *Subject:* File Channel Exception "Failed to obtain lock for writing to
> the log.Try increasing the log write timeout value"
>
>
>
> Hi,
>
>
>
> I'm using Flume-Ng 1.4 cdh4.4 Tarball for collecting aggregated logs.
>
> I am running a 2 tier(agent,collector) Flume Configuration with custom
> plugins. There are approximately 20 agents (receiving data) and 6 collector
> flume (writing to HDFS) machines all running independenly. However, I have
> been facing some File Channel Exceptions on the collector side. The agent
> appears to be working fine.
>
>
>
>  Error  stacktrace:
>
>                              org.apache.flume.ChannelException: Failed to
> obtain lock for writing to the log. Try increasing the log write timeout
> value. [channel=c2]
>
>                              at
> org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doRollback(FileChannel.java:621)
>
>                              at
> org.apache.flume.channel.BasicTransactionSemantics.rollback(BasicTransactionSemantics.java:168)
>
>                              at
> org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:421)
>
>                              at
> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
>
>                              at
> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>
>                              …..
>
>                              And I keep on getting the same error
>
>
>
>                              P.S :This same exception is repated in most
> of the flume collector machines.But, not at the same duration. There is
> usually a difference of a couple of hours or more.
>
>
>
> 1.  HDFS sinks are written in  the Amazon EC2 cloud instance.
>
> 2. datadir and checkpoint dir of file channel in all flume collector
> instances are mounted to a separate hadoop ebs drive .This makes sure that
> two separate collectors do not overlap their log and checkpoint dir. There
> is a symbolic link i.e /usr/lib/flume-ng/datasource à /hadoop/ebs/mnt-1
>
> 3. The Flume works fine for a couple of days and all the agent,collector
> are initialized properly without exceptions.
>
>
>
> Questions:
>
> Exception “Failed to obtain lock for writing to the log. Try increasing
> the log write timeout value . [channel=c2]” . According to the
> documentation, such an exception occurs only if two processes are acceesing
> the same file/directory. However, each channel is configured separately so
> No two channels should access the same dir. Hence, this exception does not
> indicates anything. Please correct me, if im wrong.
>
> Also, HDFS.CallTimeout – indicates calling HDFS for open,write operations.
> If no response within a duration, it timeouts. And , if its timeouts; it
> closes the File. Please correct me, if im wrong.  Also, if there is a way
> to specify the number of retries before it closes the file?
>
>
>
> Your inputs/suggestions will be thoroughly appreciated.
>
>
>
>
>
> Regards
>
> Kushal Mangtani
>
> Software Engineer
>
>
>
>
>
>
>
>
>

Mime
View raw message