apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep Deshmukh <sand...@datatorrent.com>
Subject Re: AbstractFileOutputOperator maxLength roll over handling
Date Fri, 11 Dec 2015 15:21:05 GMT
File size just more than the block size will create two hdfs blocks and
hence slight performance hit. Second block is likely to be very small, few
bytes, which is not advisable on HDFS.

I would vote for flipping the check but taking into account Ram's point.
On 11 Dec 2015 20:43, "Munagala Ramanath" <ram@datatorrent.com> wrote:

> Guess we don't need to worry about the case when the tuple size itself is
> larger than the
> HDFS block size :-)
>
> Ram
>
> On Fri, Dec 11, 2015 at 12:37 AM, Yogi Devendra <yogidevendra@apache.org>
> wrote:
>
> > Hi,
> >
> > I am using AbstractFileOutputOperator in my application for writing
> > incoming tuples into a file on HDFS.
> >
> > Considering that there could be failover scenarios; I am using
> > fileOutputOperator.setMaxLength() for rolling over the files after
> > specified length. Assuming that, rolled over files would have faster
> > recovery from the failure (since recovery is only for the last part of
> the
> > file and not for the entire file).
> >
> > To set the maxLength; there is no specific recommended value from the
> > usecase. Hence, I would prefer the rolled over file sizes to be equal to
> > Block size for HDFS (say 64 MB).
> >
> > With the current implementation of AbstractFileOutputOperator; actual
> file
> > sizes for the rolled over file would be slightly greater than 64MB. This
> is
> > because, file is being rolled over after the incoming tuple is written to
> > to the file. The check for file size (for roll over) happens after the
> > tuple is written to the file.
> >
> > I believe that, files slightly greater than 64MB would result in 2
> entries
> > on the NameNode. This can be avoided if we flip the sequence of checking
> > the file size (adding incoming tuple) and then rolling over to new file
> > *before* writing the incoming tuple.
> >
> > Do you think that, this improvement should be considered? If yes; I will
> > create a JIRA and work on it.
> >
> > Also, does this code change break backward compatibility? Although,
> > signature of the API remains same; but there is slight change in the
> > semantics. Thus, wanted to get feedback from the community.
> >
> > ~ Yogi
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message