hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiaobin She <xiaobin...@gmail.com>
Subject Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?
Date Thu, 19 Dec 2013 16:36:15 GMT
To Devin,

thank you very much for your explanation.

I do found that I can read the data out of the file even if I did not close
the file I'm writing to ( the read operation is call on another file
handler opened on the same file but still in the same process ), which make
me more confuse at that time, because I think since I can read the data
from the file , why can't I get the length of the file correctly.

But from the explantion that you have described, I think I can understand
it now.

So it seems in order to do what I want ( write some data to the file, and
then get the length of the file throuth webhdfs interface), I have to open
and close the file every time I do the write operation.

Thank you very much again.


2013/12/19 Devin Suiter RDX <dsuiter@rdx.com>

> Hello,
> In my experience with Flume, watching the HDFS Sink verbose output, I know
> that even after a file has flushed, but is still open, it reads as a 0-byte
> file, even if there is actually data contained in the file.
> A HDFS "file" is a meta-location that can accept streaming input for as
> long as it is open, so the length cannot be mathematically defined until a
> start and an end are in place.
> The flush operation moves data from a buffer to a storage medium, but I
> don't think that necessarily means that it tells the HDFS RecordWriter to
> place the "end of stream/EOF" marker down, since the "file" meta-location
> in HDFS is a pile of actual files around the cluster on physical disk that
> HDFS presents to you as one file. The HDFS "file" and the physical file
> splits on disk are distinct, and I would suspect that your HDFS flush calls
> are forcing Hadoop to move the physical filesplits from their physical
> datanode buffers to disk, but is not telling HDFS that you expect no
> further input - that is what the HDFS close will do.
> One thing you could try - instead of asking for the length property, which
> is probably unavailable until the close call, try asking for/viewing the
> contents of the file.
> Your scenario step 3 says "according to the header hdfs.h, after this
> call returns, *new readers should be able to see the data*" which isn't
> the same as "new readers can obtain an updated property value from the file
> metadata" - one is looking at the data inside the container, and the other
> is asking the container to describe itself.
> I hope that helps with your problem!
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
> On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xiaobinshe@gmail.com> wrote:
>> sorry to reply to my own thread.
>> Does anyone know the answer to this question?
>> If so, can you please tell me if my understanding is right or wrong?
>> thanks.
>> 2013/12/17 Xiaobin She <xiaobinshe@gmail.com>
>>> hi,
>>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>> And I have encountered an problem.
>>> here is the scenario :
>>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>>> 2. call hdfsWrite to write some data
>>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>>> after this call returns, new readers shoule be able to see the data
>>> 4. I use an http get request to get the file list on that directionary
>>> through the webhdfs interface,
>>> here  I have to use the webhdfs interface because I need to deal with
>>> symlink file
>>> 5. from the json response which is returned by the webhdfs, I found that
>>> the lenght of the file is still 0.
>>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>>> these three together, but still doesn't work.
>>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>>> the correct file lenght through the webhdfs interface.
>>> Is this right? I mean if you want the other process to see the change
>>> of data, you need to call hdfsCloseFile?
>>> Or is there somethings I did wrong?
>>> thank you very much for your help.

View raw message