hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: why does not hdfs read ahead ?
Date Tue, 24 Nov 2009 18:47:22 GMT

On Nov 24, 2009, at 12:36 PM, Todd Lipcon wrote:

> On Tue, Nov 24, 2009 at 10:33 AM, Brian Bockelman <bbockelm@cse.unl.edu>wrote:
>> On Nov 24, 2009, at 12:06 PM, Todd Lipcon wrote:
>>> Also, keep in mind that, when you open a block for reading, the DN
>>> immediately starts writing the entire block (assuming it's requested via
>> the
>>> xceiver protocol) - it's TCP backpressure on the send window that does
>> flow
>>> control there.
>> Ok, that's a pretty freakin' cool idea.  Is it well-documented how this
>> technique works?  How does this affect folks (me) who use the pread
>> interface?
> AFAIK using pread sends the actual length with the OP_READ_BLOCK command, so
> it doesn't read ahead past what you ask for. The awful thing about pread is
> that it actually makes a new datanode connection for every read - including
> the TCP handshake round trip, thread setup/teardown, etc.

I'm not going to argue with the fact that we can do better here, but it's not as bad as you
think for our particular workflow.  Our random reads are "truly random"; i.e., there are approximately
zero repeated requests of data.  Hence, the 1ms of overhead is pretty negligible compared
to spinning a hard drive (10ms when the cluster is idle, 30ms when we're pounding it).

In future versions of our software, we've made things at least "monotonically increasing".
 I.e., with a few exceptions, every position is strictly greater than the position of the
last read.  (It doesn't mean we can sequentially read out the file; our reads can be quite
sparse, only taking 10% of the file; if we read things sequentially, we'd overread by a factor
of 10, and that can start to hit network limitations).

At some point, I need to do a talk or write-up of the column-oriented techniques that HEP
folks do; after all, they've been doing column-oriented stores for the past 20 years or so.
 They have some tricks up their sleeves, and it would be interesting to compare notes.


>>> So, although it's not explicitly reading ahead, most of the
>>> reads on DFSInputStream should be coming from the TCP receive buffer, not
>>> making round trips.
>>> At one point a few weeks ago I did hack explicit readahead around
>>> DFSInputStream and didn't see an appreciable difference. I didn't spend
>> much
>>> time on it, though, so I may have screwed something up - wasn't a
>> scientific
>>> test.
>> Speaking from someone who's worked with storage systems that do an explicit
>> readahead, this can turn out to be a big giant disaster if it's combined
>> with random reads.
>> Big disaster as far as application-level throughput goes - but does make
>> for impressive ganglia graphs!
>> Brian
>>> -Todd
>>> On Tue, Nov 24, 2009 at 10:02 AM, Eli Collins <eli@cloudera.com> wrote:
>>>> Hey Martin,
>>>> It would be an interesting experiment but I'm not sure it would
>>>> improve things as the host (and hardware to some extent) are already
>>>> reading ahead. A useful exercise would be to evaluate whether the new
>>>> default host parameters for on-demand readahead are suitable for
>>>> hadoop.
>>>> http://lwn.net/Articles/235164
>>>> http://lwn.net/Articles/235181
>>>> Thanks,
>>>> Eli
>>>> On Mon, Nov 23, 2009 at 11:23 PM, Martin Mituzas <
>> xietao1981@hotmail.com>
>>>> wrote:
>>>>> I read the code and find the call
>>>>> DFSInputStream.read(buf, off, len)
>>>>> will cause the DataNode read len bytes (or less if encounting the end
>> of
>>>>> block) , why does not hdfs read ahead to improve performance for
>>>> sequential
>>>>> read?
>>>>> --
>>>>> View this message in context:
>> http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html
>>>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.

View raw message