flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SaravanaKumar TR <saran0081...@gmail.com>
Subject Re: how spooling directory source identifies the complete file
Date Wed, 23 Jul 2014 07:50:31 GMT
thanks a lot.

This answer sounds perfect for my question.Let me have a try with mv
instead of cp.


On Wed, Jul 23, 2014 at 1:16 PM, Needham, Guy <Guy.Needham@virginmedia.co.uk
> wrote:

>  Hi Saravana,
>
> Flume will check the size and the time of the last edit to the file when
> it starts reading it and when it has finished reading. If the two sets of
> values differ between the start and end of the file reading process, Flume
> will fail noisily. This means that you must move a fully written file to
> the directory or it will not be ingested into your workflow. If you're
> running it on a unix system, you can't use a cp command to drop the file
> into the directory as cp uses incremental writes whereas mv will move the
> file in one go.
>
>
> Regards,
> Guy Needham | Data Discovery
> Virgin Media | Enterprise Data, Design & Management
> Bartley Wood Business Park, Hook, Hampshire RG27 9UP
> D 01256 75 3362
> I welcome VSRE emails. Learn more at http://vsre.info/
>
>
>  ------------------------------
> *From:* SaravanaKumar TR [mailto:saran0081986@gmail.com]
> *Sent:* 23 July 2014 06:38
> *To:* user@flume.apache.org
> *Subject:* Re: how spooling directory source identifies the complete file
>
>  Thanks Ashish , I already referred to this info.
>
>  But I couldn't see any explanation in flume user guide about how flume
> differentiates between copy-in progress file and fully copied file.
>
>
> On Wed, Jul 23, 2014 at 10:59 AM, Ashish <paliwalashish@gmail.com> wrote:
>
>> This is specified in Flume's User Guide
>>
>>  "Unlike the Exec source, this source is reliable and will not miss
>> data, even if Flume is restarted or killed. In exchange for this
>> reliability, only immutable, uniquely-named files must be dropped into the
>> spooling directory. Flume tries to detect these problem conditions and will
>> fail loudly if they are violated:
>>
>>    1. If a file is written to after being placed into the spooling
>>    directory, Flume will print an error to its log file and stop processing.
>>    2. If a file name is reused at a later time, Flume will print an
>>    error to its log file and stop processing.
>>
>> To avoid the above issues, it may be useful to add a unique identifier
>> (such as a timestamp) to log file names when they are moved into the
>> spooling directory."
>>
>>
>> On Wed, Jul 23, 2014 at 10:17 AM, SaravanaKumar TR <
>> saran0081986@gmail.com> wrote:
>>
>>> Hi Jeff,
>>>
>>>  Thanks of your comments.But what I am really looking for is  ,
>>> consider we are copying a file of 1 GB to spool directory , if suppose copy
>>> is in progress , how flume recognize that the complete file is copied into
>>> the spool directory and the file is ready for processing ?
>>>
>>>  how flume make sure it doesnt start processing the partially copied
>>> file.
>>>
>>>
>>> On Tue, Jul 22, 2014 at 11:15 PM, Jeff Lord <jlord@cloudera.com> wrote:
>>>
>>>> I believe the way this works is that flume creates a meta directory to
>>>> track which file is being read.
>>>> In the event of a restart of the agent the entire file will be re-read
>>>> which will create some duplicate events.
>>>>
>>>>
>>>> https://github.com/apache/flume/blob/flume-1.5/flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java#L474
>>>>
>>>>
>>>> On Tue, Jul 22, 2014 at 6:15 AM, SaravanaKumar TR <
>>>> saran0081986@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>  I am planning to use spooling directory to move logfiles in hdfs
>>>>> sink.
>>>>>
>>>>>  I like to know how flume identifies the file we are moving to spool
>>>>> directory is complete file or partial & its move still in progress.
>>>>>
>>>>>  if suppose a file is of large size and we started moving it to
>>>>> spooler directory , how flume identifies that the complete file is
>>>>> transferred or is still in progress.
>>>>>
>>>>>  Please help me out here.
>>>>>
>>>>>  Thanks,
>>>>> saravana
>>>>>
>>>>
>>>>
>>>
>>
>>
>>   --
>> thanks
>> ashish
>>
>> Blog: http://www.ashishpaliwal.com/blog
>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>
>
>
> --------------------------------------------------------------------
> Save Paper - Do you really need to print this e-mail?
>
> Visit www.virginmedia.com for more information, and more fun.
>
> This email and any attachments are or may be confidential and legally
> privileged
> and are sent solely for the attention of the addressee(s). If you have
> received this
> email in error, please delete it from your system: its use, disclosure or
> copying is
> unauthorised. Statements and opinions expressed in this email may not
> represent
> those of Virgin Media. Any representations or commitments in this email are
> subject to contract.
>
> Registered office: Media House, Bartley Wood Business Park, Hook,
> Hampshire, RG27 9UP
> Registered in England and Wales with number 2591237
>

Mime
View raw message