Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 66008 invoked from network); 16 Apr 2009 11:54:12 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Apr 2009 11:54:12 -0000 Received: (qmail 32187 invoked by uid 500); 16 Apr 2009 11:54:10 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 32092 invoked by uid 500); 16 Apr 2009 11:54:09 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 32082 invoked by uid 99); 16 Apr 2009 11:54:09 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Apr 2009 11:54:09 +0000 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [129.93.181.2] (HELO mathstat.unl.edu) (129.93.181.2) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Apr 2009 11:53:57 +0000 Received: from [192.168.0.102] (user-0cdvqce.cable.mindspring.com [24.223.233.142]) (authenticated bits=0) by mathstat.unl.edu (8.13.8/8.13.8) with ESMTP id n3GBrUkL012230 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Thu, 16 Apr 2009 06:53:33 -0500 Message-Id: <26930CE9-EAF0-4555-BAB9-FE03DE884164@cse.unl.edu> From: Brian Bockelman To: core-user@hadoop.apache.org In-Reply-To: Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: Interesting Hadoop/FUSE-DFS access patterns Date: Thu, 16 Apr 2009 06:53:30 -0500 References: <2A3679A1-FBC5-49D9-AE29-1369BB428F8D@math.unl.edu> <52A76F90-7695-4037-89D3-B59C70E078B1@cse.unl.edu> <45f85f70904130107j76eb51bfsbe00abdbf2575d1d@mail.gmail.com> <45f85f70904130114q2b12e1fcy717ec99c8dad0234@mail.gmail.com> <5EF0484D-F0A9-4BA2-813E-F049A2313D60@cse.unl.edu> X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org Hey Tom, Yup, that's one of the things I've been looking at - however, it doesn't appear to be the likely culprit as to why data access is fairly random. The time the operation took does not seem to be a factor of the number of bytes read, at least in the smaller range. Brian On Apr 16, 2009, at 5:17 AM, Tom White wrote: > Not sure if will affect your findings, but when you read from a > FSDataInputStream you should see how many bytes were actually read by > inspecting the return value and re-read if it was fewer than you want. > See Hadoop's IOUtils readFully() method. > > Tom > > On Mon, Apr 13, 2009 at 4:22 PM, Brian Bockelman > wrote: >> >> Hey Todd, >> >> Been playing more this morning after thinking about it for the >> night -- I >> think the culprit is not the network, but actually the cache. >> Here's the >> output of your script adjusted to do the same calls as I was doing >> (you had >> left out the random I/O part). >> >> [brian@red tmp]$ java hdfs_tester >> Mean value for reads of size 0: 0.0447 >> Mean value for reads of size 16384: 10.4872 >> Mean value for reads of size 32768: 10.82925 >> Mean value for reads of size 49152: 6.2417 >> Mean value for reads of size 65536: 7.0511003 >> Mean value for reads of size 81920: 9.411599 >> Mean value for reads of size 98304: 9.378799 >> Mean value for reads of size 114688: 8.99065 >> Mean value for reads of size 131072: 5.1378503 >> Mean value for reads of size 147456: 6.1324 >> Mean value for reads of size 163840: 17.1187 >> Mean value for reads of size 180224: 6.5492 >> Mean value for reads of size 196608: 8.45695 >> Mean value for reads of size 212992: 7.4292 >> Mean value for reads of size 229376: 10.7843 >> Mean value for reads of size 245760: 9.29095 >> Mean value for reads of size 262144: 6.57865 >> >> Copy of the script below. >> >> So, without the FUSE layer, we don't see much (if any) patterns >> here. The >> overhead of randomly skipping through the file is higher than the >> overhead >> of reading out the data. >> >> Upon further inspection, the biggest factor affecting the FUSE >> layer is >> actually the Linux VFS caching -- if you notice, the bandwidth in >> the given >> graph for larger read sizes is *higher* than 1Gbps, which is the >> limit of >> the network on that particular node. If I go in the opposite >> direction - >> starting with the largest reads first, then going down to the >> smallest >> reads, the graph entirely smooths out for the small values - >> everything is >> read from the filesystem cache in the client RAM. Graph attached. >> >> So, on the upside, mounting through FUSE gives us the opportunity >> to speed >> up reads for very complex, non-sequential patterns - for free, >> thanks to the >> hardworking Linux kernel. On the downside, it's incredibly >> difficult to >> come up with simple cases to demonstrate performance for an >> application -- >> the cache performance and size depends on how much activity there's >> on the >> client, the previous file system activity that the application did, >> and the >> amount of concurrent activity on the server. I can give you >> results for >> performance, but it's not going to be the performance you see in >> real life. >> (Gee, if only file systems were easy...) >> >> Ok, sorry for the list noise -- it seems I'm going to have to think >> more >> about this problem before I can come up with something coherent. >> >> Brian >> >> >> >> >> >> import org.apache.hadoop.fs.FileSystem; >> import org.apache.hadoop.fs.FileStatus; >> import org.apache.hadoop.fs.Path; >> import org.apache.hadoop.fs.FSDataInputStream; >> import org.apache.hadoop.conf.Configuration; >> import java.io.IOException; >> import java.net.URI; >> import java.util.Random; >> >> public class hdfs_tester { >> public static void main(String[] args) throws Exception { >> URI uri = new URI("hdfs://hadoop-name:9000/"); >> FileSystem fs = FileSystem.get(uri, new Configuration()); >> Path path = new >> Path("/user/uscms01/pnfs/unl.edu/data4/cms/store/phedex_monarctest/ >> Nebraska/LoadTest07_Nebraska_33"); >> FSDataInputStream dis = fs.open(path); >> Random rand = new Random(); >> FileStatus status = fs.getFileStatus(path); >> long file_len = status.getLen(); >> int iters = 20; >> for (int size=0; size < 1024*1024; size += 4*4096) { >> long csum = 0; >> for (int i = 0; i < iters; i++) { >> int pos = rand.nextInt((int)((file_len-size-1)/8))*8; >> byte buf[] = new byte[size]; >> if (pos < 0) >> pos = 0; >> long st = System.nanoTime(); >> dis.read(pos, buf, 0, size); >> long et = System.nanoTime(); >> csum += et-st; >> //System.out.println(String.valueOf(size) + "\t" + >> String.valueOf(pos) >> + "\t" + String.valueOf(et - st)); >> } >> float csum2 = csum; csum2 /= iters; >> System.out.println("Mean value for reads of size " + size + ": >> " + >> (csum2/1000/1000)); >> } >> fs.close(); >> } >> } >> >> >> On Apr 13, 2009, at 3:14 AM, Todd Lipcon wrote: >> >>> On Mon, Apr 13, 2009 at 1:07 AM, Todd Lipcon >>> wrote: >>> >>>> Hey Brian, >>>> >>>> This is really interesting stuff. I'm curious - have you tried >>>> these same >>>> experiments using the Java API? I'm wondering whether this is >>>> FUSE-specific >>>> or inherent to all HDFS reads. I'll try to reproduce this over >>>> here as >>>> well. >>>> >>>> This smells sort of nagle-related to me... if you get a chance, >>>> you may >>>> want to edit DFSClient.java and change TCP_WINDOW_SIZE to 256 * >>>> 1024, and >>>> see if the magic number jumps up to 256KB. If so, I think it >>>> should be a >>>> pretty easy bugfix. >>>> >>> >>> Oops - spoke too fast there... looks like TCP_WINDOW_SIZE isn't >>> actually >>> used for any socket configuration, so I don't think that will make a >>> difference... still think networking might be the culprit, though. >>> >>> -Todd >>> >>> >>>> >>>> On Sun, Apr 12, 2009 at 9:41 PM, Brian Bockelman >>>> wrote: >>>> >>>>> Ok, here's something perhaps even more strange. I removed the >>>>> "seek" >>>>> part >>>>> out of my timings, so I was only timing the "read" instead of >>>>> the "seek >>>>> + >>>>> read" as in the first case. I also turned the read-ahead down >>>>> to 1-byte >>>>> (aka, off). >>>>> >>>>> The jump *always* occurs at 128KB, exactly. >>>>> >>>>> I'm a bit befuddled. I know we say that HDFS is optimized for >>>>> large, >>>>> sequential reads, not random reads - but it seems that it's one >>>>> bug-fix >>>>> away >>>>> from being a good general-purpose system. Heck if I can find >>>>> what's >>>>> causing >>>>> the issues though... >>>>> >>>>> Brian >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Apr 12, 2009, at 8:53 PM, Brian Bockelman wrote: >>>>> >>>>> Hey all, >>>>>> >>>>>> I was doing some research on I/O patterns of our applications, >>>>>> and I >>>>>> noticed the attached pattern. In case if the mail server >>>>>> strips out >>>>>> attachments, I also uploaded it: >>>>>> >>>>>> http://t2.unl.edu/store/Hadoop_64KB_ra.png >>>>>> http://t2.unl.edu/store/Hadoop_1024KB_ra.png >>>>>> >>>>>> This was taken using the FUSE mounts of Hadoop; the first one >>>>>> was with >>>>>> a >>>>>> 64KB read-ahead and the second with a 1MB read-ahead. This was >>>>>> taken >>>>>> from a >>>>>> 2GB file and randomly 'seek'ed in the file. This was performed >>>>>> 20 >>>>>> times for >>>>>> each read size, advancing in 4KB increments. Each blue dot is >>>>>> the read >>>>>> time >>>>>> of one experiment; the red dot is the median read time for the >>>>>> read >>>>>> size. >>>>>> The graphs show the absolute read time. >>>>>> >>>>>> There's very interesting behavior - it seems that there is a >>>>>> change in >>>>>> behavior around reads of size of 800KB. The time for the reads >>>>>> go down >>>>>> significantly when you read *larger* files. I thought this was >>>>>> just an >>>>>> artifact of the 64KB read-ahead I set in FUSE, so I upped the >>>>>> read-ahead >>>>>> significantly, to 1MB. In this case, the difference between >>>>>> the the >>>>>> small >>>>>> read sizes and large read sizes are *very* pronounced. If it >>>>>> was an >>>>>> artifact from FUSE, I'd expect the place where the change >>>>>> occurred >>>>>> would be >>>>>> a function of the readahead-size. >>>>>> >>>>>> Anyone out there who knows the code have any ideas? What could >>>>>> I be >>>>>> doing wrong? >>>>>> >>>>>> Brian >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >> >> >>