hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Suiter RDX <dsui...@rdx.com>
Subject Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?
Date Thu, 19 Dec 2013 13:55:46 GMT
Hello,

In my experience with Flume, watching the HDFS Sink verbose output, I know
that even after a file has flushed, but is still open, it reads as a 0-byte
file, even if there is actually data contained in the file.

A HDFS "file" is a meta-location that can accept streaming input for as
long as it is open, so the length cannot be mathematically defined until a
start and an end are in place.

The flush operation moves data from a buffer to a storage medium, but I
don't think that necessarily means that it tells the HDFS RecordWriter to
place the "end of stream/EOF" marker down, since the "file" meta-location
in HDFS is a pile of actual files around the cluster on physical disk that
HDFS presents to you as one file. The HDFS "file" and the physical file
splits on disk are distinct, and I would suspect that your HDFS flush calls
are forcing Hadoop to move the physical filesplits from their physical
datanode buffers to disk, but is not telling HDFS that you expect no
further input - that is what the HDFS close will do.

One thing you could try - instead of asking for the length property, which
is probably unavailable until the close call, try asking for/viewing the
contents of the file.

Your scenario step 3 says "according to the header hdfs.h, after this call
returns, *new readers should be able to see the data*" which isn't the same
as "new readers can obtain an updated property value from the file
metadata" - one is looking at the data inside the container, and the other
is asking the container to describe itself.

I hope that helps with your problem!


*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xiaobinshe@gmail.com> wrote:

>
> sorry to reply to my own thread.
>
> Does anyone know the answer to this question?
> If so, can you please tell me if my understanding is right or wrong?
>
> thanks.
>
>
>
> 2013/12/17 Xiaobin She <xiaobinshe@gmail.com>
>
>> hi,
>>
>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>
>> And I have encountered an problem.
>>
>> here is the scenario :
>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>> 2. call hdfsWrite to write some data
>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>> after this call returns, new readers shoule be able to see the data
>> 4. I use an http get request to get the file list on that directionary
>> through the webhdfs interface,
>> here  I have to use the webhdfs interface because I need to deal with
>> symlink file
>> 5. from the json response which is returned by the webhdfs, I found that
>> the lenght of the file is still 0.
>>
>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>> these three together, but still doesn't work.
>>
>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>> the correct file lenght through the webhdfs interface.
>>
>>
>> Is this right? I mean if you want the other process to see the change  of
>> data, you need to call hdfsCloseFile?
>>
>> Or is there somethings I did wrong?
>>
>> thank you very much for your help.
>>
>>
>>
>>
>>
>

Mime
View raw message