hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chao Shi <stepi...@live.com>
Subject Read hflushed data without reopen file
Date Fri, 27 Dec 2013 11:34:01 GMT
Hi users,

Suppose a client A opens /f and keep appending data then hflushing. Another
client B opens this file for read. I found that B can only see the snapshot
of data at the time he opens the file. (After B's opening, A may continue
to write more data. B cannot see it unless reopen.)

Looking into the code, I think this is because DFSInputStream maintains a
file length and simply report EOF when we read beyond the file length. The
file length is updated and thus the client has a chance to see longer file
1) the file is open
2) no live DNs to read from (correct? not very sure.)

I think such behaviour is inconsistent. Clients may see a sudden change of
file length. I guess a better behaviour is to always try to read beyond the
known file length at client-side and let the DN  to return EOF if no more
data. In this way, the client B can continue to see what A wrote and

A real use case for this is HBase log replication. In the region server,
there is a background thread keep polling for new HLog entries. It has to
reopen every second. This may put a pressure on NN if the number of region
servers gets larger.

Please correct me if there is anything wrong.


View raw message