hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3205) Read multiple chunks directly from FSInputChecker subclass into user buffers
Date Wed, 02 Dec 2009 07:54:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784691#action_12784691

Todd Lipcon commented on HADOOP-3205:

Circled around on this issue tonight and tried to look into the mysterious behavior with the
value of MAX_CHUNKS (the constant that determines how many checksum chunks worth we'll read
in a single go)

I wrote a quick benchmark which read a 1GB file out of /dev/shm with checksums using different
values of MAX_CHUNKS. For each value, I ran 50 separate trials and calculated the average,
as well as doing t-tests to figure out which results were within noise of each other.

../hadoop-3205-bench/mc_64.user          3.0954
../hadoop-3205-bench/mc_128.user         3.1036
../hadoop-3205-bench/mc_8.user   3.1054
../hadoop-3205-bench/mc_256.user         3.1104
../hadoop-3205-bench/mc_32.user          3.1156 ** everything below here is within noise
../hadoop-3205-bench/mc_16.user          3.1214
../hadoop-3205-bench/mc_4.user   3.2896
../hadoop-3205-bench/mc_2.user   3.427
../hadoop-3205-bench/mc_1.user   3.6832

../hadoop-3205-bench/mc_16.elapsed       3.423
../hadoop-3205-bench/mc_64.elapsed       3.425
../hadoop-3205-bench/mc_8.elapsed        3.4288
../hadoop-3205-bench/mc_256.elapsed      3.4294
../hadoop-3205-bench/mc_128.elapsed      3.434
../hadoop-3205-bench/mc_32.elapsed       3.4392 ** everything below here is within noise
../hadoop-3205-bench/mc_4.elapsed        3.6108
../hadoop-3205-bench/mc_2.elapsed        3.7032
../hadoop-3205-bench/mc_1.elapsed        3.9846

These were all done with a 64KB io.file.buffer.size, which would make us expect an optimal
value of 128, since it should eliminate a copy. The results show that there are no gains to
be had after 16 or 32 chunks being read at a time (8-16KB). The L1 cache on this machine is
128K, so that's not the magic number either.

So basically, the performance improvement here remains a mystery to me, but it's clear there
is one - about 13% for reading out of RAM on the machine above. Given these results, I'd propose
hard coding MAX_CHUNKS to 32 rather than basing it on io.file.buffer.size as I earlier figured.

On a separate note, some review responses:

bq. Was this way before your change but the back-to-back if statements on line 252-253 could
be combined triviallly.

I think you missed the "chunkPos += read;" outside the inner if? Java seems to occasionally
return -1 for EOF for some reason so I was nervous about letting that happen outside the if.
I'd be happy to add an assert read >= 0 though for this case and make it part of the contract
of readChunks to never return negative.

The rest of the review makes sense, and I'll address those things and upload a new patch.

> Read multiple chunks directly from FSInputChecker subclass into user buffers
> ----------------------------------------------------------------------------
>                 Key: HADOOP-3205
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3205
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>            Reporter: Raghu Angadi
>            Assignee: Todd Lipcon
>         Attachments: hadoop-3205.txt, hadoop-3205.txt, hadoop-3205.txt, hadoop-3205.txt
> Implementations of FSInputChecker and FSOutputSummer like DFS do not have access to full
user buffer. At any time DFS can access only up to 512 bytes even though user usually reads
with a much larger buffer (often controlled by io.file.buffer.size). This requires implementations
to double buffer data if an implementation wants to read or write larger chunks of data from
underlying storage.
> We could separate changes for FSInputChecker and FSOutputSummer into two separate jiras.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message