Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Message-Id: <E8548051-5A92-45D6-B7D2-03F1DD4DD750@cse.unl.edu>
From: Brian Bockelman <bbockelm@cse.unl.edu>
To: core-user@hadoop.apache.org
In-Reply-To: <314098690904132201j66809675p6a8705bbe27abe90@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v930.3)
Subject: Re: Interesting Hadoop/FUSE-DFS access patterns
Date: Tue, 14 Apr 2009 06:52:55 -0500
References: <52A76F90-7695-4037-89D3-B59C70E078B1@cse.unl.edu>
 <C6091267.4AAB%scott@richrelevance.com>
 <314098690904132201j66809675p6a8705bbe27abe90@mail.gmail.com>

Hey Jason,

Thanks, I'll keep this on hand as I do more tests.  I now have a C,  
Java, and python version of my testing program ;)

However, I particularly *like* the fact that there's caching going on  
- it'll help out our application immensely, I think.  I'll be looking  
at the performance both with and without the cache.

Brian

On Apr 14, 2009, at 12:01 AM, jason hadoop wrote:

> The following very simple program will tell the VM to drop the pages  
> being
> cached for a file. I tend to spin this in a for loop when making  
> large tar
> files, or otherwise working with large files, and the system  
> performance
> really smooths out.
> Since it use open(path) it will churn through the inode cache and
> directories.
> Something like this might actually significantly speed up HDFS by  
> running
> over the blocks on the datanodes, for busy clusters.
>
>
> #define _XOPEN_SOURCE 600
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
>
> /** Simple program to dump buffered data for specific files from the  
> buffer
> cache. Copyright Jason Venner 2009, License GPL*/
>
> int main( int argc, char** argv )
> {
>  int failCount = 0;
>  int i;
>  for( i = 1; i < argc; i++ ) {
>    char* file = argv[i];
>    int fd = open( file, O_RDONLY|O_LARGEFILE );
>    if (fd == -1) {
>      perror( file );
>      failCount++;
>      continue;
>    }
>    if (posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED )!=0) {
>      fprintf( stderr, "Failed to flush cache for %s %s\n",  
> argv[optind],
> strerror( posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED ) ) );
>      failCount++;
>    }
>    close(fd);
>  }
>  exit( failCount );
> }
>
>
> On Mon, Apr 13, 2009 at 4:01 PM, Scott Carey  
> <scott@richrelevance.com>wrote:
>
>>
>> On 4/12/09 9:41 PM, "Brian Bockelman" <bbockelm@cse.unl.edu> wrote:
>>
>>> Ok, here's something perhaps even more strange.  I removed the  
>>> "seek"
>>> part out of my timings, so I was only timing the "read" instead of  
>>> the
>>> "seek + read" as in the first case.  I also turned the read-ahead  
>>> down
>>> to 1-byte (aka, off).
>>>
>>> The jump *always* occurs at 128KB, exactly.
>>
>> Some random ideas:
>>
>> I have no idea how FUSE interops with the Linux block layer, but 128K
>> happens to be the default 'readahead' value for block devices,  
>> which may
>> just be a coincidence.
>>
>> For a disk 'sda', you check and set the value (in 512 byte blocks)  
>> with:
>>
>> /sbin/blockdev --getra /dev/sda
>> /sbin/blockdev --setra [num blocks] /dev/sda
>>
>>
>> I know on my file system tests, the OS readahead is not activated  
>> until a
>> series of sequential reads go through the block device, so truly  
>> random
>> access is not affected by this.  I've set it to 128MB and random  
>> iops does
>> not change on a ext3 or xfs file system.  If this applies to FUSE  
>> too,
>> there
>> may be reasons that this behavior differs.
>> Furthermore, one would not expect it to be slower to randomly read  
>> 4k than
>> randomly read up to the readahead size itself even if it did.
>>
>> I also have no idea how much of the OS device queue and block device
>> scheduler is involved with FUSE.  If those are involved, then  
>> there's a
>> bunch of stuff to tinker with there as well.
>>
>> Lastly, an FYI if you don't already know the following.  If the OS is
>> caching pages, there is a way to flush these in Linux to evict the  
>> cache.
>> See /proc/sys/vm/drop_caches .
>>
>>
>>
>>>
>>> I'm a bit befuddled.  I know we say that HDFS is optimized for  
>>> large,
>>> sequential reads, not random reads - but it seems that it's one bug-
>>> fix away from being a good general-purpose system.  Heck if I can  
>>> find
>>> what's causing the issues though...
>>>
>>> Brian
>>>
>>>
>>
>>
>
>
> -- 
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422