apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chandni Singh <chan...@datatorrent.com>
Subject Re: AbstractFileOutputOperator to be used with ftp and s3 file System
Date Tue, 03 Nov 2015 17:35:42 GMT
Hi,

Please look at the latest changes to this operator.
These changes enable overriding stream opening and closing.  Implementation
can control how they want to achieve append() if at all.

This operator from its conception is based on a cache of open streams which
has a maximum size which that if at any point of time that limit is near,
the cache will evict entries (close streams). Another setting is expiry
time which evicts and closes a stream when it hasn't been accessed for a
while in the cache.

If the user wants to actually never close the stream they can initialize
both these values to their respective max values. But in an real case
scenario the user needs to know that when a file will be eventually closed
(never written to) and using that information they can configure these
settings or again initialize them to their max and close the streams
explicitly.

Let's say if we don't have this cache and we are writing to multiple files.
Then that implies that multiple streams will always hang around in memory
(even if they weren't accessed)  all the time. This in my opinion is a
problematic design which will cause bigger issues like out of memory all
the time.

Chandni


On Tue, Nov 3, 2015 at 7:58 AM, Thomas Weise <thomas@datatorrent.com> wrote:

> Append is used to continue writing to files that were closed and left in a
> consistent state before. When append is not available, then we would need
> to disable the optimization to close and reopen files?
>
>
> On Tue, Nov 3, 2015 at 6:14 AM, Munagala Ramanath <ram@datatorrent.com>
> wrote:
>
> > Shouldn't "append" be a user-configurable property which, if false,
> causes
> > the
> > file to be overwritten ?
> >
> > Ram
> >
> > On Mon, Nov 2, 2015 at 10:51 PM, Priyanka Gugale
> > <priyanka@datatorrent.com> wrote:
> > > Hi,
> > >
> > > AbstractFileOutputOperator is used to write output files. The operator
> > has
> > > a method "getFSInstance". This initializes file system. One can
> override
> > > the method to initialize desired file system which extends hadoop
> > > FileSystem. In our implementation we have overridden "getFSInstance" to
> > > initialize FTPFileSystem.
> > >
> > > The file loader code in setup method of AbstractFileOutputOperator
> opens
> > > the file in append mode when file is already present. The issue is
> > > FTPFileSystem doesn't support append function.
> > >
> > > The solution to problem could be:
> > > 1. Override append method in FTPFileSystem.
> > >     -This would be tricky as file system doesn't support the operation.
> > And
> > > there are other file systems as well like S3 which also don't support
> > > append.
> > > 2. Avoid using functions like "append" which are not supported by some
> of
> > > the implementations of Hadoop FileSystem.
> > > 3. Write file loading logic (which is in setup method) in functions
> which
> > > can be extended by subclass to override the logic to load files (by
> > > avoiding using calls like append which are not supported by user's
> chosen
> > > file system).
> > >
> > > -Priyanka
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message