hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HDFS-755) Read multiple checksum chunks at once in DFSInputStream
Date Wed, 11 Nov 2009 07:46:39 GMT

     [ https://issues.apache.org/jira/browse/HDFS-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Todd Lipcon updated HDFS-755:

    Attachment: hdfs-755.txt

Here's a fairly small patch which uses the support for reading multiple checksum chunks from
HADOOP-3205. I haven't run the full test suite yet, but got about halfway through and it seems
to work - I'll be sure to put it through full testing before it gets committed. I'll also
run this on a cluster and get TestDFSIO throughput numbers.

Performance results look to be in line with what we see in HADOOP-3205.

Benchmark setup:
  - I put a 700MB file on a psuedodistributed HDFS cluster.
  - I did 30 "fs -cat" of this file without the patch applied, and 30 with it applied. In
both cases I did a couple cats first to make sure it was in the buffer cache. I can run another
set of benchmarks that drops cache in between runs if people would like.
  - In both benchmark cases, the patch from HADOOP-3205 was applied. I used a 64K io.file.buffer.size
for both the DN and the client.

T-test results (alternative hypothesis = "with patch is faster")
Wall clock time: p-value = 2.644e-07 -> 100% confidence. 95% confidence interval of 3.4%
User time: p-value = 1.638e-10 -> 100% confidence. 95% confidence interval of 3.9% speedup
Sys time: p-value = 0.982 - that is to say above 95% confidence that we *slowed down* sys
time. The confidence interval is about 0.7%

The 95% confidence intervals in this benchmark are less impressive sounding than the ones
in HADOOP-3205 because I used fewer samples.

As to why the sys time slowed down, it's a bit of a mystery. My best guess is that, since
we're now reading from the network sockets in larger chunks, we occasionally block in the
kernel where we used to pretty much always read from a full buffer. But, this isn't too concerning
- the wall clock time is what really matters.

> Read multiple checksum chunks at once in DFSInputStream
> -------------------------------------------------------
>                 Key: HDFS-755
>                 URL: https://issues.apache.org/jira/browse/HDFS-755
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs client
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-755.txt
> HADOOP-3205 adds the ability for FSInputChecker subclasses to read multiple checksum
chunks in a single call to readChunk. This is the HDFS-side use of that new feature.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message