hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: FSDataOutputStream flush() not working?
Date Fri, 15 May 2009 17:14:33 GMT
Hi Sasha,

What version are you running? Up until very recent versions, sync() was not
implemented. Even in the newest releases, sync isn't completely finished,
and you may find unreliable behavior.

For now, if you need this kind of behavior, your best bet is to close each
file and then open the next every N minutes. For example, if you're
processing logs every 5 minutes, simply close log file log.00223 and round
robin to log.00224 right before you need the data to be available to
readers. If you're collecting data at a low rate, these files may end up
being rather small, and you should probably look into doing merges on the
hour/day/etc to avoid small-file proliferation.

If you want to track the work being done around append and sync, check out
HADOOP-5744 and the issues referenced therein:

http://issues.apache.org/jira/browse/HADOOP-5744

Hope that helps,
-Todd

On Fri, May 15, 2009 at 6:35 AM, Sasha Dolgy <sdolgy@gmail.com> wrote:

> Hi there, forgive the repost:
>
> Right now data is received in parallel and is written to a queue, then a
> single thread reads the queue and writes those messages to a
> FSDataOutputStream which is kept open, but the messages never get flushed.
>  Tried flush() and sync() with no joy.
>
> 1.
> outputStream.writeBytes(rawMessage.toString());
>
> 2.
>
> log.debug("Flushing stream, size = " + s.getOutputStream().size());
> s.getOutputStream().sync();
> log.debug("Flushed stream, size = " + s.getOutputStream().size());
>
> or
>
> log.debug("Flushing stream, size = " + s.getOutputStream().size());
> s.getOutputStream().flush();
> log.debug("Flushed stream, size = " + s.getOutputStream().size());
>
> The size() remains the same after performing this action.
>
> 2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28)
> hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
> 2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49)
> hdfs.HdfsQueueConsumer: Re-using existing stream
> 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63)
> hdfs.HdfsQueueConsumer: Flushing stream, size = 1986
> 2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013)
> hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986
> lastFlushOffset 1731
> 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66)
> hdfs.HdfsQueueConsumer: Flushed stream, size = 1986
> 2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39)
> hdfs.HdfsQueueConsumer: Consumer writing event
> 2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28)
> hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
> 2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49)
> hdfs.HdfsQueueConsumer: Re-using existing stream
> 2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63)
> hdfs.HdfsQueueConsumer: Flushing stream, size = 2235
> 2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013)
> hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235
> lastFlushOffset 1986
> 2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66)
> hdfs.HdfsQueueConsumer: Flushed stream, size = 2235
>
> So although the Offset is changing as expected, the output stream isn't
> being flushed or cleared out and isn't being written to file unless the
> stream is closed() ... is this the expected behaviour?
>
> -sd
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message