Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 24981 invoked from network); 13 Apr 2009 15:23:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Apr 2009 15:23:29 -0000 Received: (qmail 80835 invoked by uid 500); 13 Apr 2009 15:23:26 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 80749 invoked by uid 500); 13 Apr 2009 15:23:26 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 80738 invoked by uid 99); 13 Apr 2009 15:23:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Apr 2009 15:23:26 +0000 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [129.93.181.2] (HELO mathstat.unl.edu) (129.93.181.2) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Apr 2009 15:23:15 +0000 Received: from pcp088893pcs.unl.edu (pcp088893pcs.unl.edu [129.93.158.8]) (authenticated bits=0) by mathstat.unl.edu (8.13.8/8.13.8) with ESMTP id n3DFMpMR013809 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Mon, 13 Apr 2009 10:22:53 -0500 Message-Id: <5EF0484D-F0A9-4BA2-813E-F049A2313D60@cse.unl.edu> From: Brian Bockelman To: core-user@hadoop.apache.org In-Reply-To: <45f85f70904130114q2b12e1fcy717ec99c8dad0234@mail.gmail.com> Content-Type: multipart/mixed; boundary=Apple-Mail-54--664033649 Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: Interesting Hadoop/FUSE-DFS access patterns Date: Mon, 13 Apr 2009 10:22:51 -0500 References: <2A3679A1-FBC5-49D9-AE29-1369BB428F8D@math.unl.edu> <52A76F90-7695-4037-89D3-B59C70E078B1@cse.unl.edu> <45f85f70904130107j76eb51bfsbe00abdbf2575d1d@mail.gmail.com> <45f85f70904130114q2b12e1fcy717ec99c8dad0234@mail.gmail.com> X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-54--664033649 Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Hey Todd, Been playing more this morning after thinking about it for the night -- I think the culprit is not the network, but actually the cache. Here's the output of your script adjusted to do the same calls as I was doing (you had left out the random I/O part). [brian@red tmp]$ java hdfs_tester Mean value for reads of size 0: 0.0447 Mean value for reads of size 16384: 10.4872 Mean value for reads of size 32768: 10.82925 Mean value for reads of size 49152: 6.2417 Mean value for reads of size 65536: 7.0511003 Mean value for reads of size 81920: 9.411599 Mean value for reads of size 98304: 9.378799 Mean value for reads of size 114688: 8.99065 Mean value for reads of size 131072: 5.1378503 Mean value for reads of size 147456: 6.1324 Mean value for reads of size 163840: 17.1187 Mean value for reads of size 180224: 6.5492 Mean value for reads of size 196608: 8.45695 Mean value for reads of size 212992: 7.4292 Mean value for reads of size 229376: 10.7843 Mean value for reads of size 245760: 9.29095 Mean value for reads of size 262144: 6.57865 Copy of the script below. So, without the FUSE layer, we don't see much (if any) patterns here. The overhead of randomly skipping through the file is higher than the overhead of reading out the data. Upon further inspection, the biggest factor affecting the FUSE layer is actually the Linux VFS caching -- if you notice, the bandwidth in the given graph for larger read sizes is *higher* than 1Gbps, which is the limit of the network on that particular node. If I go in the opposite direction - starting with the largest reads first, then going down to the smallest reads, the graph entirely smooths out for the small values - everything is read from the filesystem cache in the client RAM. Graph attached. So, on the upside, mounting through FUSE gives us the opportunity to speed up reads for very complex, non-sequential patterns - for free, thanks to the hardworking Linux kernel. On the downside, it's incredibly difficult to come up with simple cases to demonstrate performance for an application -- the cache performance and size depends on how much activity there's on the client, the previous file system activity that the application did, and the amount of concurrent activity on the server. I can give you results for performance, but it's not going to be the performance you see in real life. (Gee, if only file systems were easy...) Ok, sorry for the list noise -- it seems I'm going to have to think more about this problem before I can come up with something coherent. Brian --Apple-Mail-54--664033649 Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.conf.Configuration; import java.io.IOException; import java.net.URI; import java.util.Random; public class hdfs_tester { public static void main(String[] args) throws Exception { URI uri = new URI("hdfs://hadoop-name:9000/"); FileSystem fs = FileSystem.get(uri, new Configuration()); Path path = new Path("/user/uscms01/pnfs/unl.edu/data4/cms/store/ phedex_monarctest/Nebraska/LoadTest07_Nebraska_33"); FSDataInputStream dis = fs.open(path); Random rand = new Random(); FileStatus status = fs.getFileStatus(path); long file_len = status.getLen(); int iters = 20; for (int size=0; size < 1024*1024; size += 4*4096) { long csum = 0; for (int i = 0; i < iters; i++) { int pos = rand.nextInt((int)((file_len-size-1)/8))*8; byte buf[] = new byte[size]; if (pos < 0) pos = 0; long st = System.nanoTime(); dis.read(pos, buf, 0, size); long et = System.nanoTime(); csum += et-st; //System.out.println(String.valueOf(size) + "\t" + String.valueOf(pos) + "\t" + String.valueOf(et - st)); } float csum2 = csum; csum2 /= iters; System.out.println("Mean value for reads of size " + size + ": " + (csum2/1000/1000)); } fs.close(); } } On Apr 13, 2009, at 3:14 AM, Todd Lipcon wrote: > On Mon, Apr 13, 2009 at 1:07 AM, Todd Lipcon > wrote: > >> Hey Brian, >> >> This is really interesting stuff. I'm curious - have you tried >> these same >> experiments using the Java API? I'm wondering whether this is FUSE- >> specific >> or inherent to all HDFS reads. I'll try to reproduce this over here >> as well. >> >> This smells sort of nagle-related to me... if you get a chance, you >> may >> want to edit DFSClient.java and change TCP_WINDOW_SIZE to 256 * >> 1024, and >> see if the magic number jumps up to 256KB. If so, I think it should >> be a >> pretty easy bugfix. >> > > Oops - spoke too fast there... looks like TCP_WINDOW_SIZE isn't > actually > used for any socket configuration, so I don't think that will make a > difference... still think networking might be the culprit, though. > > -Todd > > >> >> On Sun, Apr 12, 2009 at 9:41 PM, Brian Bockelman > >wrote: >> >>> Ok, here's something perhaps even more strange. I removed the >>> "seek" part >>> out of my timings, so I was only timing the "read" instead of the >>> "seek + >>> read" as in the first case. I also turned the read-ahead down to >>> 1-byte >>> (aka, off). >>> >>> The jump *always* occurs at 128KB, exactly. >>> >>> I'm a bit befuddled. I know we say that HDFS is optimized for >>> large, >>> sequential reads, not random reads - but it seems that it's one >>> bug-fix away >>> from being a good general-purpose system. Heck if I can find >>> what's causing >>> the issues though... >>> >>> Brian >>> >>> >>> >>> >>> >>> On Apr 12, 2009, at 8:53 PM, Brian Bockelman wrote: >>> >>> Hey all, >>>> >>>> I was doing some research on I/O patterns of our applications, >>>> and I >>>> noticed the attached pattern. In case if the mail server strips >>>> out >>>> attachments, I also uploaded it: >>>> >>>> http://t2.unl.edu/store/Hadoop_64KB_ra.png >>>> http://t2.unl.edu/store/Hadoop_1024KB_ra.png >>>> >>>> This was taken using the FUSE mounts of Hadoop; the first one was >>>> with a >>>> 64KB read-ahead and the second with a 1MB read-ahead. This was >>>> taken from a >>>> 2GB file and randomly 'seek'ed in the file. This was performed >>>> 20 times for >>>> each read size, advancing in 4KB increments. Each blue dot is >>>> the read time >>>> of one experiment; the red dot is the median read time for the >>>> read size. >>>> The graphs show the absolute read time. >>>> >>>> There's very interesting behavior - it seems that there is a >>>> change in >>>> behavior around reads of size of 800KB. The time for the reads >>>> go down >>>> significantly when you read *larger* files. I thought this was >>>> just an >>>> artifact of the 64KB read-ahead I set in FUSE, so I upped the >>>> read-ahead >>>> significantly, to 1MB. In this case, the difference between the >>>> the small >>>> read sizes and large read sizes are *very* pronounced. If it was >>>> an >>>> artifact from FUSE, I'd expect the place where the change >>>> occurred would be >>>> a function of the readahead-size. >>>> >>>> Anyone out there who knows the code have any ideas? What could I >>>> be >>>> doing wrong? >>>> >>>> Brian >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >> --Apple-Mail-54--664033649--