hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Naredula Janardhana Reddy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-959) Performance improvements to DFSClient and DataNode for faster DFS write at replication factor of 1
Date Wed, 10 Feb 2010 12:08:28 GMT

    [ https://issues.apache.org/jira/browse/HDFS-959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831965#action_12831965
] 

Naredula Janardhana Reddy commented on HDFS-959:
------------------------------------------------

bq. In my mind, the ack signifies that the packet has been successfully written.
  Regarding Concurrency of Disk Writes, the assumption I made is that intermediate acks need
not wait for the disk write, but the final ack should be sent only after the entire block
of data is written to the OS. This appears safe and I see a similar early-ack implemented
in trunk, but since this is done sequentially (in one thread), there is at most 1 ack sent
ahead of its corresponding disk write(). In my proposed implementation, many acks corresponding
to unwritten data can be sent.

bq. This optimization you've described should only affect latency, not throughput. The fact
that you're seeing a 15-20% gain here suprises me.

The 15-20% improvement is because of concurrency in computation. Verifying checksum &
disk writes are done by the (newly added) diskwriter thread. Network related (packet recv,
packet send and ack send) are done by the current network thread concurrently.

Prior to the optimization all these done sequentially. With the optimization, we can push
better throughput by overlapping network and disk-write related computation.

EXAMPLE:
 If Xi is the time taken for network computation on the the i-th packet and
 Yi is time taken for disk computation (verify checksum+disk write) for i-th packet.

 Total time taken in current code z1 = Sum(Xi + Yi) (for all values of i)
 Total time taken in the optimised code z2 = Maximum ( Sum(Xi), Sum(Yi) )

z2 will be lesser than z1 if Xi and Yi are of comparable overhead.

For a 128M block, with 64k size packets, we will have 2048 packets.

This can be optimised even further by splitting the CRC computation overhead between the network
and disk threads on the fly, as follows:

While queuing a packet to the disk-write thread, network thread checks the size of the queue.
If it is above a certain threshold value, then network thread will call the VerifyChunks function
on the packet to verify CRC, and it sets a flag in the packet indicating CRC verfication is
done. Then it queues the packet to the queue. By doing this Yi becomes smaller and Xi becomes
larger (currently, Yi is much higher than Xi based on my measurements)

I have incorporated the above load-balancing mechanism in the patch that I will upload shortly.

By doing this I am getting an extra 18% improvement. Overall from the concurrency I am now
getting around is 33-38% improvement. All this is only with replication=1. With higher replication
factors, network bottlenecks come into play and shield the efffect of these improvements.


> Performance improvements to DFSClient and DataNode for faster DFS write at replication
factor of 1
> --------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-959
>                 URL: https://issues.apache.org/jira/browse/HDFS-959
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, hdfs client
>    Affects Versions: 0.20.2, 0.22.0
>         Environment: RHEL5 on Dual CPU quad-core Intel servers, 16 GB RAM, 4 SATA disks.
>            Reporter: Naredula Janardhana Reddy
>             Fix For: 0.20.2, 0.22.0
>
>
> The following improvements are suggested to DFSClient and DataNode to improve DFS write
throughput, based on experimental verification with replication factor of 1.
> The changes are useful in principle for replication factors of 2 and 3 as well, but they
do not currently demonstrate noticeable performance improvement in our test-bed because of
a network throughput bottleneck that hides the benefit of these changes. 
> All changes are applicable to 0.20.2. Some of them are applicable to trunk, as noted
below. I have not verified applicability to 0.21.
> List of Improvements
> -----------------------------
> Item 1: DFSCilent. Finer grain locks in WriteChunk(). Currently the lock is held at the
data block level (512 bytes). It can be moved to the packet level (64kbytes), to lower the
frequency of locking.
>  This optimization applies to 20.2. It already appears in trunk.
> Item 2: Misc. improvements to DataNode
>  2.1:  Concurrency of Disk Writes: Check sum verification and writing data to disk can
be moved to a separate thread ("Disk Write Thread"). This will allow the existing "network
thread" to trigger faster  acks to the DFSClient. This will also allow the packet to be transmitted
to the replication node faster. In effect, this will allow DataNode to consume packets at
higher speeds.
>  This optimization applies to 20.2 and trunk.
>  2.2:  Bulk Receive and Bulk Send: This optimization is enabled by doing 2.1. We can
now have DataNode receive more than one packet at a time since we have added a buffer between
the (existing) network thread and the (newly added) Disk Write thread.
>  This optimization applies to 20.2 and trunk.
>  2.3: Early Ack:  The proposed optimization is to send out acks to the client as soon
as possible instead of waiting for the disk write. Note that, the last ack is an exception:
It will be sent only after data has been flushed to the OS.
>  This optimization applies to 20.2. It already appears in trunk.
>  2.4: lseek optimization: Currently lseek (the system call) is called before every disk
write, which is not necessary when the write is sequential. The propsed optimization calls
lseek only when necessary.
>  This optimization applies to 20.2. I was unable to tell if it is already in trunk.
>  2.5 Checksum buffered writes: Currently checksum is written in a buffered stream of
size 512 bytes. This can be increased to a higher numbers - such as 4kbytes - to lower the
number of write() system calls. This will save context switch overhead.
>  This optimization applies to 20.2. I was unable to tell if it is already in trunk.
> Item 3: Applying HADOOP-6166 - PureJavaCrc32() - from trunk to 20.2
>  This is applicable to 20.2.  It already appears in trunk.
> Performance Experiments Results
> -----------------------------------------------
> Performance experiments showed the following numbers:
> Hadoop Version: 0.20.2
> Server Configs: RHEL5, Quad-core dual-CPU, 16GB RAM, 4 SATA disks
>  $ uname -a
>  Linux gsbl90324.blue.ygrid.yahoo.com 2.6.18-53.1.13.el5 #1 SMP Mon Feb 11 13:27:27 EST
2008 x86_64 x86_64 x86_64 GNU/Linux
>  $ cat /proc/cpuinfo
>  model name	: Intel(R) Xeon(R) CPU           L5420  @ 2.50GHz
>  $ cat /etc/issue
>  Red Hat Enterprise Linux Server release 5.1 (Tikanga)
>  Kernel \r on an \m
> Benchmark Details
> --------------------------
> Benchmark Name: DFSIO
> Benchmark Configuration:
>  a) # maps (writers to DFS per node). Tried the following values: 1,2,3
>  b) # of nodes: Single-node test and 15-node cluster test
> Results Summary
> --------------------------
> a) With all the above optimizations turned on
> All these tests were done with replication factor of 1. Tests with replication factors
of 2 and 3 showed no noticeably improvement, because these improvements are shielded by network
bandwidth as noted above.
> What was measured: Write throughput per client (in MB/s)
> | Test Description                                                          |  Baseline
(MB/s)  | With improvements (MB/s) |  % improvement |
> | 15-node cluster with 1 map (writer) per node       |  103                        |
147                                          | ~43 %                      |
> | Single node test with 1 maps (writer) per node    |  102                        | 148
                                         | ~45 %                      |
> | Single node test with 2 maps (writers) per node  |   86                         | 101
                                         | ~16 %                      |
> | Single node test with 3 maps (writers) per node  |   67                         | 
 76                                          | ~13 %                       |
>     
> a) With above optimizations turned on individually
> I ran some experiments by adding and removing items individually to understand the approximate
range of performance contribution from each item. These are the numbers I got (They are approximate).
> | ITEM        | Title                                                               
             | Improvement in 0.20 | Improvement in trunk |
> | Item 1      | DFSCilent. Finer grain locks in WriteChunk()    |      30%          
              | Already in trunk            |
> | Item 2.1   | Concurrency of Disk Writes                                   |     25%
                         | 15-20%                         |
> | Item 2.2   | Bulk Receive and Bulk Send                                 |       2%
                         | (Have not yet tried)      |
> | Item 2.3   | Early Ack                                                            
       |       2%                          | Already in trunk            |
> | Item 2.4   | lseek optimization                                                   |
      2%                          | (Have not yet tried)       |
> | Item 2.5   | Checksum buffered writes                                    |       2%
                         | (Have not yet tried)       |
> | Item 3      | Applying HADOOP-6166 - PureJavaCrc32()       |    15%               
          | Already in trunk             |
> Patches
> -----------
> I will submit a patch for 0.20.2 shortly (in a day).
> I expect to submit a patch for trunk after review comments for above patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message