apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chandni Singh <chan...@datatorrent.com>
Subject Re: AbstractFileOutputOperator to be used with ftp and s3 file System
Date Tue, 03 Nov 2015 23:55:40 GMT
Here is an abstract implementation that can work with filesystems that
don't support append

https://github.com/chandnisingh/Malhar/blob/examples/library/src/main/java/com/datatorrent/lib/io/fs/AbstractNonAppendFileOutputOperator.java

On Tue, Nov 3, 2015 at 9:45 AM, Chandni Singh <chandni@datatorrent.com>
wrote:

> Will do.
>
> On Tue, Nov 3, 2015 at 9:41 AM, Thomas Weise <thomas@datatorrent.com>
> wrote:
>
>> Agreed, there will be be applications that write to many files that cannot
>> be all remain open forever.
>>
>> Can you provide an example on how to modify the append behavior depending
>> on HFS implementation?
>>
>> https://malhar.atlassian.net/browse/MLHR-1888
>>
>>
>> On Tue, Nov 3, 2015 at 9:35 AM, Chandni Singh <chandni@datatorrent.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Please look at the latest changes to this operator.
>> > These changes enable overriding stream opening and closing.
>> Implementation
>> > can control how they want to achieve append() if at all.
>> >
>> > This operator from its conception is based on a cache of open streams
>> which
>> > has a maximum size which that if at any point of time that limit is
>> near,
>> > the cache will evict entries (close streams). Another setting is expiry
>> > time which evicts and closes a stream when it hasn't been accessed for a
>> > while in the cache.
>> >
>> > If the user wants to actually never close the stream they can initialize
>> > both these values to their respective max values. But in an real case
>> > scenario the user needs to know that when a file will be eventually
>> closed
>> > (never written to) and using that information they can configure these
>> > settings or again initialize them to their max and close the streams
>> > explicitly.
>> >
>> > Let's say if we don't have this cache and we are writing to multiple
>> files.
>> > Then that implies that multiple streams will always hang around in
>> memory
>> > (even if they weren't accessed)  all the time. This in my opinion is a
>> > problematic design which will cause bigger issues like out of memory all
>> > the time.
>> >
>> > Chandni
>> >
>> >
>> > On Tue, Nov 3, 2015 at 7:58 AM, Thomas Weise <thomas@datatorrent.com>
>> > wrote:
>> >
>> > > Append is used to continue writing to files that were closed and left
>> in
>> > a
>> > > consistent state before. When append is not available, then we would
>> need
>> > > to disable the optimization to close and reopen files?
>> > >
>> > >
>> > > On Tue, Nov 3, 2015 at 6:14 AM, Munagala Ramanath <
>> ram@datatorrent.com>
>> > > wrote:
>> > >
>> > > > Shouldn't "append" be a user-configurable property which, if false,
>> > > causes
>> > > > the
>> > > > file to be overwritten ?
>> > > >
>> > > > Ram
>> > > >
>> > > > On Mon, Nov 2, 2015 at 10:51 PM, Priyanka Gugale
>> > > > <priyanka@datatorrent.com> wrote:
>> > > > > Hi,
>> > > > >
>> > > > > AbstractFileOutputOperator is used to write output files. The
>> > operator
>> > > > has
>> > > > > a method "getFSInstance". This initializes file system. One can
>> > > override
>> > > > > the method to initialize desired file system which extends hadoop
>> > > > > FileSystem. In our implementation we have overridden
>> "getFSInstance"
>> > to
>> > > > > initialize FTPFileSystem.
>> > > > >
>> > > > > The file loader code in setup method of AbstractFileOutputOperator
>> > > opens
>> > > > > the file in append mode when file is already present. The issue
is
>> > > > > FTPFileSystem doesn't support append function.
>> > > > >
>> > > > > The solution to problem could be:
>> > > > > 1. Override append method in FTPFileSystem.
>> > > > >     -This would be tricky as file system doesn't support the
>> > operation.
>> > > > And
>> > > > > there are other file systems as well like S3 which also don't
>> support
>> > > > > append.
>> > > > > 2. Avoid using functions like "append" which are not supported
by
>> > some
>> > > of
>> > > > > the implementations of Hadoop FileSystem.
>> > > > > 3. Write file loading logic (which is in setup method) in
>> functions
>> > > which
>> > > > > can be extended by subclass to override the logic to load files
>> (by
>> > > > > avoiding using calls like append which are not supported by user's
>> > > chosen
>> > > > > file system).
>> > > > >
>> > > > > -Priyanka
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message