apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Weise <tho...@datatorrent.com>
Subject Re: AbstractFileOutputOperator to be used with ftp and s3 file System
Date Tue, 03 Nov 2015 17:41:51 GMT
Agreed, there will be be applications that write to many files that cannot
be all remain open forever.

Can you provide an example on how to modify the append behavior depending
on HFS implementation?

https://malhar.atlassian.net/browse/MLHR-1888


On Tue, Nov 3, 2015 at 9:35 AM, Chandni Singh <chandni@datatorrent.com>
wrote:

> Hi,
>
> Please look at the latest changes to this operator.
> These changes enable overriding stream opening and closing.  Implementation
> can control how they want to achieve append() if at all.
>
> This operator from its conception is based on a cache of open streams which
> has a maximum size which that if at any point of time that limit is near,
> the cache will evict entries (close streams). Another setting is expiry
> time which evicts and closes a stream when it hasn't been accessed for a
> while in the cache.
>
> If the user wants to actually never close the stream they can initialize
> both these values to their respective max values. But in an real case
> scenario the user needs to know that when a file will be eventually closed
> (never written to) and using that information they can configure these
> settings or again initialize them to their max and close the streams
> explicitly.
>
> Let's say if we don't have this cache and we are writing to multiple files.
> Then that implies that multiple streams will always hang around in memory
> (even if they weren't accessed)  all the time. This in my opinion is a
> problematic design which will cause bigger issues like out of memory all
> the time.
>
> Chandni
>
>
> On Tue, Nov 3, 2015 at 7:58 AM, Thomas Weise <thomas@datatorrent.com>
> wrote:
>
> > Append is used to continue writing to files that were closed and left in
> a
> > consistent state before. When append is not available, then we would need
> > to disable the optimization to close and reopen files?
> >
> >
> > On Tue, Nov 3, 2015 at 6:14 AM, Munagala Ramanath <ram@datatorrent.com>
> > wrote:
> >
> > > Shouldn't "append" be a user-configurable property which, if false,
> > causes
> > > the
> > > file to be overwritten ?
> > >
> > > Ram
> > >
> > > On Mon, Nov 2, 2015 at 10:51 PM, Priyanka Gugale
> > > <priyanka@datatorrent.com> wrote:
> > > > Hi,
> > > >
> > > > AbstractFileOutputOperator is used to write output files. The
> operator
> > > has
> > > > a method "getFSInstance". This initializes file system. One can
> > override
> > > > the method to initialize desired file system which extends hadoop
> > > > FileSystem. In our implementation we have overridden "getFSInstance"
> to
> > > > initialize FTPFileSystem.
> > > >
> > > > The file loader code in setup method of AbstractFileOutputOperator
> > opens
> > > > the file in append mode when file is already present. The issue is
> > > > FTPFileSystem doesn't support append function.
> > > >
> > > > The solution to problem could be:
> > > > 1. Override append method in FTPFileSystem.
> > > >     -This would be tricky as file system doesn't support the
> operation.
> > > And
> > > > there are other file systems as well like S3 which also don't support
> > > > append.
> > > > 2. Avoid using functions like "append" which are not supported by
> some
> > of
> > > > the implementations of Hadoop FileSystem.
> > > > 3. Write file loading logic (which is in setup method) in functions
> > which
> > > > can be extended by subclass to override the logic to load files (by
> > > > avoiding using calls like append which are not supported by user's
> > chosen
> > > > file system).
> > > >
> > > > -Priyanka
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message