hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Can we modify files in HDFS?
Date Tue, 29 Jun 2010 09:57:01 GMT
elton sky wrote:
> thanx Jeff,
> So...it is a significant drawback.
> As a matter of fact, there are many cases we need to modify.

When people say "Hadoop filesystems are not posix", this is what they 
mean. No locks, no read/write. seeking discouraged. Even append is 
something that is just stabilising. to be fair though, even NFS is 
quirky and that's been around since Ether-net was considered so cutting 
edge it had a hyphen in the title.

HDFS delivers availability through redundant copies across multiple 
machines: you can read your data on or near any machine with a copy of 
the data. Think what you'd need for full seek and read/write actions

* seek would kill bulk IO perf on classic rotating-disk HDDs, and nobody 
can afford to build a petabyte filestore out of SSDs yet. You should be 
streaming, not seeking.

* to do writes, you'd need to lock out access to the files, which 
implies a distributed lock infrastructure (zookeeper?) or deal with 
conflicting writes.

* if you want immediate update writes you'd need to push out the changes 
to the (existing) nodes, and deal with queueing up pending changes to 
machines that are currently offline in a way that I don't want to think 

* if you want slower-update writes (eventual consistency), then things 
may be slightly simpler -you'd need a lock on writing and each write 
would eventually be pushed out to the readers with a bit better 
bandwidth and CPU scheduling flexibility , but there's still that 
offline node problem. If a node that was down comes up, how does it know 
it's data is out of date and where does it get the data from? What will 
it do if all other nodes that have updated data are offline.

 > I dont understand why Yahoo didn't provoid that functionality. And as 
I know
 > no one else is working on this. Why is that?

It's because it scares us and we are happier writing code to live in a 
world where you don't seek and patch files, but instead add new data and 
delete old stuff. I don't know what the Cassandra and HBase teams do here.


View raw message