Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 48136 invoked from network); 14 Apr 2009 11:53:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 Apr 2009 11:53:33 -0000 Received: (qmail 22402 invoked by uid 500); 14 Apr 2009 11:53:30 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 22308 invoked by uid 500); 14 Apr 2009 11:53:30 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 22298 invoked by uid 99); 14 Apr 2009 11:53:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Apr 2009 11:53:30 +0000 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [129.93.181.2] (HELO mathstat.unl.edu) (129.93.181.2) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Apr 2009 11:53:21 +0000 Received: from [192.168.0.102] (user-0cdvqce.cable.mindspring.com [24.223.233.142]) (authenticated bits=0) by mathstat.unl.edu (8.13.8/8.13.8) with ESMTP id n3EBqtiP017426 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Tue, 14 Apr 2009 06:52:58 -0500 Message-Id: From: Brian Bockelman To: core-user@hadoop.apache.org In-Reply-To: <314098690904132201j66809675p6a8705bbe27abe90@mail.gmail.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: Interesting Hadoop/FUSE-DFS access patterns Date: Tue, 14 Apr 2009 06:52:55 -0500 References: <52A76F90-7695-4037-89D3-B59C70E078B1@cse.unl.edu> <314098690904132201j66809675p6a8705bbe27abe90@mail.gmail.com> X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org Hey Jason, Thanks, I'll keep this on hand as I do more tests. I now have a C, Java, and python version of my testing program ;) However, I particularly *like* the fact that there's caching going on - it'll help out our application immensely, I think. I'll be looking at the performance both with and without the cache. Brian On Apr 14, 2009, at 12:01 AM, jason hadoop wrote: > The following very simple program will tell the VM to drop the pages > being > cached for a file. I tend to spin this in a for loop when making > large tar > files, or otherwise working with large files, and the system > performance > really smooths out. > Since it use open(path) it will churn through the inode cache and > directories. > Something like this might actually significantly speed up HDFS by > running > over the blocks on the datanodes, for busy clusters. > > > #define _XOPEN_SOURCE 600 > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > > /** Simple program to dump buffered data for specific files from the > buffer > cache. Copyright Jason Venner 2009, License GPL*/ > > int main( int argc, char** argv ) > { > int failCount = 0; > int i; > for( i = 1; i < argc; i++ ) { > char* file = argv[i]; > int fd = open( file, O_RDONLY|O_LARGEFILE ); > if (fd == -1) { > perror( file ); > failCount++; > continue; > } > if (posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED )!=0) { > fprintf( stderr, "Failed to flush cache for %s %s\n", > argv[optind], > strerror( posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED ) ) ); > failCount++; > } > close(fd); > } > exit( failCount ); > } > > > On Mon, Apr 13, 2009 at 4:01 PM, Scott Carey > wrote: > >> >> On 4/12/09 9:41 PM, "Brian Bockelman" wrote: >> >>> Ok, here's something perhaps even more strange. I removed the >>> "seek" >>> part out of my timings, so I was only timing the "read" instead of >>> the >>> "seek + read" as in the first case. I also turned the read-ahead >>> down >>> to 1-byte (aka, off). >>> >>> The jump *always* occurs at 128KB, exactly. >> >> Some random ideas: >> >> I have no idea how FUSE interops with the Linux block layer, but 128K >> happens to be the default 'readahead' value for block devices, >> which may >> just be a coincidence. >> >> For a disk 'sda', you check and set the value (in 512 byte blocks) >> with: >> >> /sbin/blockdev --getra /dev/sda >> /sbin/blockdev --setra [num blocks] /dev/sda >> >> >> I know on my file system tests, the OS readahead is not activated >> until a >> series of sequential reads go through the block device, so truly >> random >> access is not affected by this. I've set it to 128MB and random >> iops does >> not change on a ext3 or xfs file system. If this applies to FUSE >> too, >> there >> may be reasons that this behavior differs. >> Furthermore, one would not expect it to be slower to randomly read >> 4k than >> randomly read up to the readahead size itself even if it did. >> >> I also have no idea how much of the OS device queue and block device >> scheduler is involved with FUSE. If those are involved, then >> there's a >> bunch of stuff to tinker with there as well. >> >> Lastly, an FYI if you don't already know the following. If the OS is >> caching pages, there is a way to flush these in Linux to evict the >> cache. >> See /proc/sys/vm/drop_caches . >> >> >> >>> >>> I'm a bit befuddled. I know we say that HDFS is optimized for >>> large, >>> sequential reads, not random reads - but it seems that it's one bug- >>> fix away from being a good general-purpose system. Heck if I can >>> find >>> what's causing the issues though... >>> >>> Brian >>> >>> >> >> > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422