hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Da Zheng <zhengda1...@gmail.com>
Subject Re: Hadoop use direct I/O in Linux?
Date Wed, 05 Jan 2011 15:07:29 GMT
On 1/5/11 12:44 AM, Christopher Smith wrote:
> On Tue, Jan 4, 2011 at 9:11 PM, Da Zheng <zhengda@cs.jhu.edu> wrote:
>> On 1/4/11 5:17 PM, Christopher Smith wrote:
>>> If you use direct I/O to reduce CPU time, that means you are saving CPU
>> via
>>> DMA. If you are using Java's heap though, you can kiss that goodbye.
>> The buffer for direct I/O cannot be allocated from Java's heap anyway, I
>> don't
>> understand what you mean?
> The DMA buffer cannot be on Java's heap, but in the typical use case (say
> Hadoop), it would certainly have to get copied either in to our out from
> Java's heap, and that's going to get the CPU involved whether you like it
> nor not. If you stay entirely off the Java heap, you really don't get to use
> much of Java's object model or capabilities, so you have to wonder why use
> Java in the first place.
true. I wrote the code with JNI, and found it's still very close to its best
performance when doing one or even two memory copy.
>>> That said, I'm surprised that the Atom can't keep up with magnetic disk
>>> unless you have a striped array. 100MB/s shouldn't be too taxing. Is it
>>> possible you're doing something wrong or your CPU is otherwise occupied?
>> Yes, my C program can reach 100MB/s or even 110MB/s when writing data to
>> the
>> disk sequentially, but with direct I/O enabled, the maximal throughput is
>> about
>> 140MB/s. But the biggest difference is CPU usage.
>> Without direct I/O, operating system uses a lot of CPU time (the data below
>> is
>> got with top, and this is a dual-core processor with hyperthread enabled).
>> Cpu(s):  3.4%us, 32.8%sy,  0.0%ni, 50.0%id, 12.1%wa,  0.0%hi,  1.6%si,
>>  0.0%st
>> But with direct I/O, the system time can be as little as 3%.
> I'm surprised that system time is really that high. We did Atom experiments
> where it wasn't even close to that. Are you using a memory mapped file? If
No, I don't. just simply write a large chunk of data to the memory and the code
is attached below. Right now the buffer size is 1MB, I think it's big enough to
get the best performance.
> not are you buffering your writes? Is there perhaps
> something dysfunctional about the drive controller/driver you are using?
I'm not sure. It's also odd to me, but I thought it's what I can get from a Atom
processor. I guess I need to do some profiling.
Also, which Atom processors did you use? do you have hyperthread enabled?


int main (int argc, char *argv[])
    char *out_file;
    int outfd;
    ssize_t size;
    time_t start_time2;
    long size1 = 0;

    out_file = argv[1];

    outfd = open (out_file, O_CREAT | O_WRONLY, S_IWUSR | S_IRUSR);
    if (outfd < 0) {
        perror ("open");
        return -1;

    buf = mmap (0, bufsize, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS,
-1, 0);

    start_time2 = start_time = time (NULL);
    signal(SIGINT , sighandler);
    int offset = 0;

    while (1) {
        fill_data ((int *) buf, bufsize);
        size = write (outfd, buf, bufsize);
        if (size < 0) {
            perror ("fwrite");
            return 1;
        offset += size;
        tot_size += size;
        size1 += size;
//      if (posix_fadvise (outfd, 0, offset, POSIX_FADV_NOREUSE) < 0)
//          perror ("posix_fadvise");

        time_t end_time = time (NULL);
        if (end_time - start_time2 > 5) {
            printf ("current rate: %ld\n",
                    (long) (size1 / (end_time - start_time2)));
            size1 = 0;
            start_time2 = end_time;

View raw message