hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Iman E <hadoop_...@yahoo.com>
Subject Re: caching in hdfs?
Date Tue, 21 Jul 2009 00:07:10 GMT
Thank you, Dhruba, Todd for your reply.

From: Todd Lipcon <todd@cloudera.com>
To: hdfs-user@hadoop.apache.org
Sent: Sunday, July 19, 2009 11:23:02 PM
Subject: Re: caching in hdfs?

In addition to what Dhruba said, there is implicitly some caching going on at the linux filesystem
level on the DataNodes. If someone has recently read a block on a given datanode, and another
read occurs on the same block, it's likely to be served from the Linux buffer cache, assuming
you have enough RAM free.


On Sun, Jul 19, 2009 at 10:50 PM, Dhruba Borthakur <dhruba@gmail.com> wrote:

>I am assuming that you are talking about a map-reduce job. In this case, if you run your
job twice, each mapper will contact the namenode everytime the mapper starts.
>f you use FSDataInputStream to read a HDFS file, data is streamed from the datanode(s)
to the client. It is buffered as part of FSDataInputStream. However, if you open the same
file again and get another FSDataInputStream, the buffer of the first stream is not shared
with the buffer associated with the second stream (although they refer to the same HDFS file)
>On Sun, Jul 19, 2009 at 10:11 PM, Iman E <hadoop_ami@yahoo.com> wrote:
>>I would like to know if hdfs do caching by default at slaves. If I ran my job twice
and I am assuming that the data is split the same way each time, is the namenode contacted
everytime to know the loaction of these files? Also, is the data read directly from disk everytime
or it can be read from the cache? I am using FSDataInputStream   to open the files and read

View raw message