chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmed Fathalla (JIRA)" <j...@apache.org>
Subject [jira] Updated: (CHUKWA-4) Collectors don't finish writing .done datasink from last .chukwa datasink when stopped using bin/stop-collectors
Date Mon, 12 Apr 2010 17:27:51 GMT

     [ https://issues.apache.org/jira/browse/CHUKWA-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ahmed Fathalla updated CHUKWA-4:
--------------------------------

    Attachment: CHUKWA-4.patch

This patch contains a fix for corrupt sink files created locally. I've created a new class
CopySequenceFile which copies the corrupt .chukwa file to a valid .done file.

The code for recovering a failed copy attempt is included in the cleanup() method of LocalToRemoteHdfsMover
and follows Jerome's suggestions. I have also created a unit test that creates a sink file,
converts it into a .done file and validates that the .done file was created and the .chukwa
file removed.

I have tested this solution several times and it seems to be working. However, I have faced
a rare case where recovery fails because I get the following exception while reading from
the .chukwa file / writing to the .done file


2010-04-12 07:56:47,538 WARN LocalToRemoteHdfsMover CopySequenceFile - Error during .chukwa
file recovery
java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:180)
	at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
	at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
	at org.apache.hadoop.chukwa.util.CopySequenceFile.createValidSequenceFile(CopySequenceFile.java:80)
	at org.apache.hadoop.chukwa.datacollection.writer.localfs.LocalToRemoteHdfsMover.cleanup(LocalToRemoteHdfsMover.java:185)
	at org.apache.hadoop.chukwa.datacollection.writer.localfs.LocalToRemoteHdfsMover.run(LocalToRemoteHdfsMover.java:215)


This seemed to happen when recovering from a .chukwa file that was just created before the
collector crashed (the .chukwa file size was about ~200KB) so I guess it might be that the
file has no actual data and should be removed. I would appreciate it if you can point out
how we can deal with this situation.

> Collectors don't finish writing .done datasink from last .chukwa datasink when stopped
using bin/stop-collectors
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: CHUKWA-4
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-4
>             Project: Hadoop Chukwa
>          Issue Type: Bug
>          Components: data collection
>         Environment: I am running on our local cluster. This is a linux machine that
I also run Hadoop cluster from.
>            Reporter: Andy Konwinski
>            Priority: Minor
>         Attachments: CHUKWA-4.patch
>
>
> When I use start-collectors, it creates the datasink as expected, writes to it as per
normal, i.e. writes to the .chukwa file, and roll overs work fine when it renames the .chukwa
file to .done. However, when I use bin/stop-collectors to shut down the running collector
it leaves a .chukwa file in the HDFS file system. Not sure if this is a valid sink or not,
but I think that the collector should gracefully clean up the datasink and rename it .done
before exiting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message