hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-347) DFS read performance suboptimal when client co-located on nodes with data
Date Sat, 12 Jan 2013 01:57:11 GMT

    [ https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551755#comment-13551755
] 

Todd Lipcon commented on HDFS-347:
----------------------------------

I believe all of the component pieces have now been committed to the HDFS-347 branch. I ran
a number of benchmarks yesterday on the branch in progress, and just re-confirmed the results
from the code committed in SVN. Here's a report of the benchmarks and results:

h1. Benchmarks

To validate the branch, I ran a series of before/after benchmarks, specifically focused on
random-read. In particular, I ran benchmarks based on TestParallelRead, which has different
variants which run the same workload through the different read paths.

On the trunk ("before") branch, I ran TestParallelRead (normal read path) and TestParallelLocalRead
(read path based on HDFS-2246). On the HDFS-347 branch, I ran TestParallelRead (normal read
path) and TestParallelShortCircuitRead (new short-circuit path).

I made the following modifications to the test cases to act as a better benchmark:

1) Modified to 0% PROPORTION_NON_READ:

Without this modification, I found that both the 'before' and 'after' tests became lock-bound,
since the 'seek-and-read' workload holds a lock on the DFSInputStream. So, this obscured the
actual performance differences between the data paths.

2) Modified to 30,000 iterations

Simply jacked up the number of iterations to get more reproducible results and ensure that
the JIT had plenty of time to kick in (the benchmarks ran for ~50seconds each with this change
instead of only ~5sec)

3) Added a variation which has two target blocks

I had a thought that there could potentially be a regression for workloads which frequently
switch back and forth between two different blocks of the same file. This variation is the
same test, but with the DFS Block Size set to 128KB, so that the 256KB test file is split
into two equal sized blocks. This causes a good percentage of the random reads to span block
boundaries, and ensures that the various caches in the code work OK even when moving between
different blocks.

h2. Comparing non-local read

When the new code path is disabled, or when the DN is not local, we continue to use the existing
code path. We expect that this code path's performance should be unaffected.

Results:

|| Test || #Threads || #Files|| Trunk MB/sec || HDFS-347 MB/sec ||
|| TestParallelRead | 4  | 1 | 428.4 | 423.0 |
|| TestParallelRead | 16 | 1 | 669.5 | 651.1 |
|| TestParallelRead | 8  | 2 | 603.4 | 582.7 |
|| TestParallelRead 2-blocks | 4 | 1  | 354.0 | 345.9 |
|| TestParallelRead 2-blocks | 16 | 1 | 534.9 | 520.0 |
|| TestParallelRead 2-blocks | 8 | 2  | 483.1 |  460.8 |

The above numbers seem to show a 2-4% regression, but I think it's within the noise on my
machine (other software was running, etc). Colin also has one or two ideas for micro-optimizations
which might win back a couple percent here and there, if it's not just noise.

To put this in perspective, here are results for the same test against branch-1:

|| Test || #Threads || #Files|| Branch-1 ||
|| TestParallelRead | 4  | 1 | 229.7 |
|| TestParallelRead | 16 | 1 | 264.4 |
|| TestParallelRead | 8  | 2 | 260.1 |

(so trunk is 2-3x as fast as branch-1)

h2. Comparing local read

Here we expect the performance to be as good or better than the old (HDFS-2246) implementation.
Results:

|| Test || #Threads || #Files|| Trunk MB/sec || HDFS-347 MB/sec ||
|| TestParallelLocalRead | 4  | 1 | 901.4  | 1033.6 |
|| TestParallelLocalRead | 16 | 1 | 1079.8 | 1203.9 |
|| TestParallelLocalRead | 8  | 2 | 1087.4 | 1274.0 |
|| TestParallelLocalRead 2-blocks | 4  | 1 | 856.6  | 919.2 |
|| TestParallelLocalRead 2-blocks | 16 | 1 | 1045.8 |  1137.0 |
|| TestParallelLocalRead 2-blocks | 8  | 2 | 966.7  |  1392.9 |

The result shows that the new implementation is indeed between 10% and 44% faster than the
HDFS-2246 implementation. We're theorizing that the reason is because the old implementation
would cache block paths, but not open file descriptors. So, because every positional read
creates a new BlockReader, it would have to issue new {{open()}} syscalls, even if the location
was cached.

h2. Comparing sequential read

I used the BenchmarkThroughput tool, configured to write a 1GB file, and then read it back
100 times. This ensures that it's in buffer cache, so that we're benchmarking CPU overhead
(since the actual disk access didn't change in the patch, and we're looking for a potential
regression in CPU resource usage). I recorded the MB/sec rate for the short-circuit before
and short-circuit after, and then loaded the data into R and ran a T-test:

{code}
> d.before <- read.table(file="/tmp/before-patch.txt")
> d.after <- read.table(file="/tmp/after-patch.txt")
> t.test(d.before, d.after)
> d.before <- read.table(file="/tmp/before-patch.txt")
> d.after <- read.table(file="/tmp/after-patch.txt")
> t.test(d.before, d.after)

        Welch Two Sample t-test

data:  d.before and d.after 
t = 0.5936, df = 199.777, p-value = 0.5535
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -62.39975 116.14431 
sample estimates:
mean of x mean of y 
 2939.456  2912.584 
{code}

The p-value 0.55 means that there's no statistically significant difference in the performance
of the two data paths for sequential workloads.

I did the same thing with short-circuit disabled and got the following t-test results for
the RemoteBlockReader code path:

{code}
> d.before <- read.table(file="/tmp/before-patch-rbr.txt")
> d.after <- read.table(file="/tmp/after-patch-rbr.txt")
> t.test(d.before, d.after)

        Welch Two Sample t-test

data:  d.before and d.after 
t = 1.155, df = 199.89, p-value = 0.2495
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -18.69172  71.54320 
sample estimates:
mean of x mean of y 
 1454.653  1428.228 
{code}

Again, the p-value 0.25 means there's no significant difference in performance.

h2. Summary

The patch provides a good speedup (up to 40% in one case) for some random read workloads,
and has no discernible negative impact on others.

                
> DFS read performance suboptimal when client co-located on nodes with data
> -------------------------------------------------------------------------
>
>                 Key: HDFS-347
>                 URL: https://issues.apache.org/jira/browse/HDFS-347
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, hdfs-client, performance
>            Reporter: George Porter
>            Assignee: Colin Patrick McCabe
>         Attachments: all.tsv, BlockReaderLocal1.txt, HADOOP-4801.1.patch, HADOOP-4801.2.patch,
HADOOP-4801.3.patch, HDFS-347-016_cleaned.patch, HDFS-347.016.patch, HDFS-347.017.clean.patch,
HDFS-347.017.patch, HDFS-347.018.clean.patch, HDFS-347.018.patch2, HDFS-347.019.patch, HDFS-347.020.patch,
HDFS-347.021.patch, HDFS-347.022.patch, HDFS-347.024.patch, HDFS-347.025.patch, HDFS-347.026.patch,
HDFS-347.027.patch, HDFS-347.029.patch, HDFS-347.030.patch, HDFS-347.033.patch, HDFS-347.035.patch,
HDFS-347-branch-20-append.txt, hdfs-347.png, hdfs-347.txt, local-reads-doc
>
>
> One of the major strategies Hadoop uses to get scalable data processing is to move the
code to the data.  However, putting the DFS client on the same physical node as the data blocks
it acts on doesn't improve read performance as much as expected.
> After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem is due
to the HDFS streaming protocol causing many more read I/O operations (iops) than necessary.
 Consider the case of a DFSClient fetching a 64 MB disk block from the DataNode process (running
in a separate JVM) running on the same machine.  The DataNode will satisfy the single disk
block request by sending data back to the HDFS client in 64-KB chunks.  In BlockSender.java,
this is done in the sendChunk() method, relying on Java's transferTo() method.  Depending
on the host O/S and JVM implementation, transferTo() is implemented as either a sendfilev()
syscall or a pair of mmap() and write().  In either case, each chunk is read from the disk
by issuing a separate I/O operation for each chunk.  The result is that the single request
for a 64-MB block ends up hitting the disk as over a thousand smaller requests for 64-KB each.
> Since the DFSClient runs in a different JVM and process than the DataNode, shuttling
data from the disk to the DFSClient also results in context switches each time network packets
get sent (in this case, the 64-kb chunk turns into a large number of 1500 byte packet send
operations).  Thus we see a large number of context switches for each block send operation.
> I'd like to get some feedback on the best way to address this, but I think providing
a mechanism for a DFSClient to directly open data blocks that happen to be on the same machine.
 It could do this by examining the set of LocatedBlocks returned by the NameNode, marking
those that should be resident on the local host.  Since the DataNode and DFSClient (probably)
share the same hadoop configuration, the DFSClient should be able to find the files holding
the block data, and it could directly open them and send data back to the client.  This would
avoid the context switches imposed by the network layer, and would allow for much larger read
buffers than 64KB, which should reduce the number of iops imposed by each read block operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message