Mailing-List: contact user-help@flume.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flume.apache.org
Received-SPF: pass (athena.apache.org: domain of birchall@infoscience.co.jp
 designates 219.101.133.187 as permitted sender)
Date: 3 Aug 2012 17:45:39 +0900
Message-ID: <501B8FB3.3000909@infoscience.co.jp>
From: "=?ISO-2022-JP?B?GyRCJVAhPCVBJWMlayEhJS8laiU5JUglVSUhITwbKEI=?="
 <birchall@infoscience.co.jp>
To: user@flume.apache.org
User-Agent: Mozilla/5.0 (Windows NT 5.1;
 rv:14.0) Gecko/20120713 Thunderbird/14.0
MIME-Version: 1.0
Subject: Re: Writing reliably to HDFS
References: <5019E0CC.5000303@infoscience.co.jp>
 <5019F7C9.9000506@cyberagent.co.jp>
In-Reply-To: <5019F7C9.9000506@cyberagent.co.jp>
Content-Type: text/plain; charset=ISO-2022-JP
Content-Transfer-Encoding: 7bit

Juhani,

Thanks for the advice.

Just to clarify, when I talk about the agent "dying", I mean crashing or
being killed unexpectedly. I'm worried about how the HDFS writing works
in these cases. When the agent is shutdown cleanly, I can confirm that
all HDFS files are closed correctly and no .tmp files are left lying around.

In the case where the agent dies suddenly and zero-byte .tmp files are
left over, I still haven't found a way to get Hadoop to fix those files
for me.

Chris.


On 2012/08/02 12:45, Juhani Connolly wrote:
> Hi Chris,
>
> Answers inline
>
> On 08/02/2012 11:07 AM, バーチャル　クリストファー wrote:
>> Hi,
>>
>> I'm trying to write events to HDFS using Flume 1.2.0 and I have a couple
>> of questions.
>>
>> Firstly, about the reliability semantics of the HdfsEventSink.
>>
>> My number one requirement is reliability, i.e. not losing any events.
>> Ideally, by the time the HdfsEventSink commits the transaction, all
>> events should be safely written to HDFS and visible to other clients, so
>> that no data is lost even if the agent dies after that point. But what
>> is actually happening in my tests is as follows:
>>
>> 1. The HDFS sink takes some events from the FileChannel and writes them
>> to a SequenceFile on HDFS
>> 2. The sink commits the transaction, and the FileChannel updates its
>> checkpoint. As far as FileChannel is concerned, the events have been
>> safely written to the sink.
>> 3. Kill the agent.
>>
>> Result: I'm left with a weird zero-byte, non-zero-byte tmp file on HDFS.
>> The SequenceFile has not yet been closed and rolled over, so it is still
>> a ".tmp" file. The data is actually in the HDFS blocks, but because the
>> file was not closed, the NameNode thinks it has a length of 0 bytes. I'm
>> not sure how to recover from this.
>>
>> Is this the expected behaviour of the HDFS sink, or am I doing something
>> wrong? Do I need to explicitly enable HDFS append? (I am using HDFS
>> 2.0.0-alpha)
>>
>> I guess the problem is that data is not "safely" written until file
>> rollover occurs, but the timing of file rollover (by time, log count,
>> file size, etc.) is unrelated to the timing of transactions. Is there
>> any way to put these in sync with each other?
> Regarding reliability, I believe that while the file may not be closed,
> you're not actually at risk of losing data. I suspect that adding in
> some code to the sink shutdown to close up any temp files may be a good
> idea. To deal with unexpected failure it may even be an idea to try
> scanning the dest path for any unclosed files on startup.
>
> I'm not really too familiar with the workings of hdfs sink so maybe
> someone else can add more detail. In our test setup we have yet to have
> any data loss from it.
>> Second question: Could somebody please explain the reasoning behind the
>> default values of the HDFS sink configuration? If I use the defaults,
>> the sink generates zillions of tiny files (max 10 events per file),
>> which as I understand it is not a recommended way to use HDFS.
>>
>> Is it OK to change these settings to generate much larger files (MB, GB
>> scale)? Or should I write a script that periodically combines these tiny
>> files into larger ones?
>>
>> Thanks for any advice,
>>
>> Chris Birchall.
>>
> There's no harm in changing those defaults and I'd strongly recommend
> doing so. We have most of the rolls switched off(set to 0) and we just
> roll hourly(because that's how we want to separate our logs). You may
> also want to change the hdfs.batchSize which defaults to 1... Which is
> gong to cause a bottleneck if you have even a moderate amount of
> traffic. One thing to note is that with large batches, it's possible for
> events to be duplicated(if the batch got partially written and then had
> an error, it will get rollbacked at the channel and then rewritten).
>
>
>