hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Da Zheng <zhengda1...@gmail.com>
Subject Re: Hadoop use direct I/O in Linux?
Date Wed, 05 Jan 2011 15:19:23 GMT
On 1/5/11 9:50 AM, Segel, Mike wrote:
> You are mixing a few things up.
> You're testing your I/O using C. 
> What do you see if you try testing your direct I/O from Java?
> I'm guessing that you'll keep your i/o piece in place and wrap it within some JNI code
and then re-write the test in Java? 
I tested both.
> Also are you testing large streams or random i/o blocks? (Hopefully both)
I only test large streams. For mapreduce, the only random i/o access is in
between mapping and reducing, right? where the output from mappers is sorted,
spilled to the disk and then merge sort. Then reducers need another merge sort
after they pull data. All these operations are not completely random. Maybe
there is some random access for metadata, but it should be small.
> I think that when you test out the system, you'll find that you won't see much, if any
performance improvement.
> -----Original Message-----
> From: Da Zheng [mailto:zhengda@cs.jhu.edu] 
> Sent: Tuesday, January 04, 2011 11:11 PM
> To: common-dev@hadoop.apache.org
> Subject: Re: Hadoop use direct I/O in Linux?
> On 1/4/11 5:17 PM, Christopher Smith wrote:
>> If you use direct I/O to reduce CPU time, that means you are saving CPU via
>> DMA. If you are using Java's heap though, you can kiss that goodbye.
> The buffer for direct I/O cannot be allocated from Java's heap anyway, I don't
> understand what you mean?
>> That said, I'm surprised that the Atom can't keep up with magnetic disk
>> unless you have a striped array. 100MB/s shouldn't be too taxing. Is it
>> possible you're doing something wrong or your CPU is otherwise occupied?
> Yes, my C program can reach 100MB/s or even 110MB/s when writing data to the
> disk sequentially, but with direct I/O enabled, the maximal throughput is about
> 140MB/s. But the biggest difference is CPU usage.
> Without direct I/O, operating system uses a lot of CPU time (the data below is
> got with top, and this is a dual-core processor with hyperthread enabled).
> Cpu(s):  3.4%us, 32.8%sy,  0.0%ni, 50.0%id, 12.1%wa,  0.0%hi,  1.6%si,  0.0%st
> But with direct I/O, the system time can be as little as 3%.
> Best,
> Da
>> On Tue, Jan 4, 2011 at 9:58 AM, Da Zheng <zhengda@cs.jhu.edu> wrote:
>>> The most important reason for me to use direct I/O is that the Atom
>>> processor is too weak. If I wrote a simple program to write data to the
>>> disk, CPU is almost 100% but the disk hasn't reached its maximal bandwidth.
>>> When I write data to SSD, the difference is even larger. Even if the program
>>> has saturated the two cores of the CPU, it cannot even get to the half of
>>> the maximal bandwidth of SSD.
>>> I don't know how much benefit direct I/O can bring to the normal processor
>>> such as Xeon, but I have a feeling I have to use direct I/O in order to have
>>> good performance on Atom processors.
>>> Best,
>>> Da
> The information contained in this communication may be CONFIDENTIAL and is intended only
for the use of the recipient(s) named above.  If you are not the intended recipient, you are
hereby notified that any dissemination, distribution, or copying of this communication, or
any of its contents, is strictly prohibited.  If you have received this communication in error,
please notify the sender and delete/destroy the original message and any copy of it from your
computer or paper files.

View raw message