Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-Id: <26930CE9-EAF0-4555-BAB9-FE03DE884164@cse.unl.edu>
From: Brian Bockelman <bbockelm@cse.unl.edu>
To: core-user@hadoop.apache.org
In-Reply-To: <ac79ea400904160317k6ce89ff7u4696f18778b23685@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v930.3)
Subject: Re: Interesting Hadoop/FUSE-DFS access patterns
Date: Thu, 16 Apr 2009 06:53:30 -0500
References: <2A3679A1-FBC5-49D9-AE29-1369BB428F8D@math.unl.edu>
 <52A76F90-7695-4037-89D3-B59C70E078B1@cse.unl.edu>
 <45f85f70904130107j76eb51bfsbe00abdbf2575d1d@mail.gmail.com>
 <45f85f70904130114q2b12e1fcy717ec99c8dad0234@mail.gmail.com>
 <5EF0484D-F0A9-4BA2-813E-F049A2313D60@cse.unl.edu>
 <ac79ea400904160317k6ce89ff7u4696f18778b23685@mail.gmail.com>

Hey Tom,

Yup, that's one of the things I've been looking at - however, it  
doesn't appear to be the likely culprit as to why data access is  
fairly random.  The time the operation took does not seem to be a  
factor of the number of bytes read, at least in the smaller range.

Brian

On Apr 16, 2009, at 5:17 AM, Tom White wrote:

> Not sure if will affect your findings, but when you read from a
> FSDataInputStream you should see how many bytes were actually read by
> inspecting the return value and re-read if it was fewer than you want.
> See Hadoop's IOUtils readFully() method.
>
> Tom
>
> On Mon, Apr 13, 2009 at 4:22 PM, Brian Bockelman  
> <bbockelm@cse.unl.edu> wrote:
>>
>> Hey Todd,
>>
>> Been playing more this morning after thinking about it for the  
>> night -- I
>> think the culprit is not the network, but actually the cache.   
>> Here's the
>> output of your script adjusted to do the same calls as I was doing  
>> (you had
>> left out the random I/O part).
>>
>> [brian@red tmp]$ java hdfs_tester
>> Mean value for reads of size 0: 0.0447
>> Mean value for reads of size 16384: 10.4872
>> Mean value for reads of size 32768: 10.82925
>> Mean value for reads of size 49152: 6.2417
>> Mean value for reads of size 65536: 7.0511003
>> Mean value for reads of size 81920: 9.411599
>> Mean value for reads of size 98304: 9.378799
>> Mean value for reads of size 114688: 8.99065
>> Mean value for reads of size 131072: 5.1378503
>> Mean value for reads of size 147456: 6.1324
>> Mean value for reads of size 163840: 17.1187
>> Mean value for reads of size 180224: 6.5492
>> Mean value for reads of size 196608: 8.45695
>> Mean value for reads of size 212992: 7.4292
>> Mean value for reads of size 229376: 10.7843
>> Mean value for reads of size 245760: 9.29095
>> Mean value for reads of size 262144: 6.57865
>>
>> Copy of the script below.
>>
>> So, without the FUSE layer, we don't see much (if any) patterns  
>> here.  The
>> overhead of randomly skipping through the file is higher than the  
>> overhead
>> of reading out the data.
>>
>> Upon further inspection, the biggest factor affecting the FUSE  
>> layer is
>> actually the Linux VFS caching -- if you notice, the bandwidth in  
>> the given
>> graph for larger read sizes is *higher* than 1Gbps, which is the  
>> limit of
>> the network on that particular node.  If I go in the opposite  
>> direction -
>> starting with the largest reads first, then going down to the  
>> smallest
>> reads, the graph entirely smooths out for the small values -  
>> everything is
>> read from the filesystem cache in the client RAM.  Graph attached.
>>
>> So, on the upside, mounting through FUSE gives us the opportunity  
>> to speed
>> up reads for very complex, non-sequential patterns - for free,  
>> thanks to the
>> hardworking Linux kernel.  On the downside, it's incredibly  
>> difficult to
>> come up with simple cases to demonstrate performance for an  
>> application --
>> the cache performance and size depends on how much activity there's  
>> on the
>> client, the previous file system activity that the application did,  
>> and the
>> amount of concurrent activity on the server.  I can give you  
>> results for
>> performance, but it's not going to be the performance you see in  
>> real life.
>>  (Gee, if only file systems were easy...)
>>
>> Ok, sorry for the list noise -- it seems I'm going to have to think  
>> more
>> about this problem before I can come up with something coherent.
>>
>> Brian
>>
>>
>>
>>
>>
>> import org.apache.hadoop.fs.FileSystem;
>> import org.apache.hadoop.fs.FileStatus;
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.fs.FSDataInputStream;
>> import org.apache.hadoop.conf.Configuration;
>> import java.io.IOException;
>> import java.net.URI;
>> import java.util.Random;
>>
>> public class hdfs_tester {
>>  public static void main(String[] args) throws Exception {
>>   URI uri = new URI("hdfs://hadoop-name:9000/");
>>   FileSystem fs = FileSystem.get(uri, new Configuration());
>>   Path path = new
>> Path("/user/uscms01/pnfs/unl.edu/data4/cms/store/phedex_monarctest/ 
>> Nebraska/LoadTest07_Nebraska_33");
>>   FSDataInputStream dis = fs.open(path);
>>   Random rand = new Random();
>>   FileStatus status = fs.getFileStatus(path);
>>   long file_len = status.getLen();
>>   int iters = 20;
>>   for (int size=0; size < 1024*1024; size += 4*4096) {
>>     long csum = 0;
>>     for (int i = 0; i < iters; i++) {
>>       int pos = rand.nextInt((int)((file_len-size-1)/8))*8;
>>       byte buf[] = new byte[size];
>>       if (pos < 0)
>>         pos = 0;
>>       long st = System.nanoTime();
>>       dis.read(pos, buf, 0, size);
>>       long et = System.nanoTime();
>>       csum += et-st;
>>       //System.out.println(String.valueOf(size) + "\t" +  
>> String.valueOf(pos)
>> + "\t" + String.valueOf(et - st));
>>     }
>>     float csum2 = csum; csum2 /= iters;
>>     System.out.println("Mean value for reads of size " + size + ":  
>> " +
>> (csum2/1000/1000));
>>   }
>>   fs.close();
>>  }
>> }
>>
>>
>> On Apr 13, 2009, at 3:14 AM, Todd Lipcon wrote:
>>
>>> On Mon, Apr 13, 2009 at 1:07 AM, Todd Lipcon <todd@cloudera.com>  
>>> wrote:
>>>
>>>> Hey Brian,
>>>>
>>>> This is really interesting stuff. I'm curious - have you tried  
>>>> these same
>>>> experiments using the Java API? I'm wondering whether this is
>>>> FUSE-specific
>>>> or inherent to all HDFS reads. I'll try to reproduce this over  
>>>> here as
>>>> well.
>>>>
>>>> This smells sort of nagle-related to me... if you get a chance,  
>>>> you may
>>>> want to edit DFSClient.java and change TCP_WINDOW_SIZE to 256 *  
>>>> 1024, and
>>>> see if the magic number jumps up to 256KB. If so, I think it  
>>>> should be a
>>>> pretty easy bugfix.
>>>>
>>>
>>> Oops - spoke too fast there... looks like TCP_WINDOW_SIZE isn't  
>>> actually
>>> used for any socket configuration, so I don't think that will make a
>>> difference... still think networking might be the culprit, though.
>>>
>>> -Todd
>>>
>>>
>>>>
>>>> On Sun, Apr 12, 2009 at 9:41 PM, Brian Bockelman
>>>> <bbockelm@cse.unl.edu>wrote:
>>>>
>>>>> Ok, here's something perhaps even more strange.  I removed the  
>>>>> "seek"
>>>>> part
>>>>> out of my timings, so I was only timing the "read" instead of  
>>>>> the "seek
>>>>> +
>>>>> read" as in the first case.  I also turned the read-ahead down  
>>>>> to 1-byte
>>>>> (aka, off).
>>>>>
>>>>> The jump *always* occurs at 128KB, exactly.
>>>>>
>>>>> I'm a bit befuddled.  I know we say that HDFS is optimized for  
>>>>> large,
>>>>> sequential reads, not random reads - but it seems that it's one  
>>>>> bug-fix
>>>>> away
>>>>> from being a good general-purpose system.  Heck if I can find  
>>>>> what's
>>>>> causing
>>>>> the issues though...
>>>>>
>>>>> Brian
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Apr 12, 2009, at 8:53 PM, Brian Bockelman wrote:
>>>>>
>>>>> Hey all,
>>>>>>
>>>>>> I was doing some research on I/O patterns of our applications,  
>>>>>> and I
>>>>>> noticed the attached pattern.  In case if the mail server  
>>>>>> strips out
>>>>>> attachments, I also uploaded it:
>>>>>>
>>>>>> http://t2.unl.edu/store/Hadoop_64KB_ra.png
>>>>>> http://t2.unl.edu/store/Hadoop_1024KB_ra.png
>>>>>>
>>>>>> This was taken using the FUSE mounts of Hadoop; the first one  
>>>>>> was with
>>>>>> a
>>>>>> 64KB read-ahead and the second with a 1MB read-ahead.  This was  
>>>>>> taken
>>>>>> from a
>>>>>> 2GB file and randomly 'seek'ed in the file.  This was performed  
>>>>>> 20
>>>>>> times for
>>>>>> each read size, advancing in 4KB increments.  Each blue dot is  
>>>>>> the read
>>>>>> time
>>>>>> of one experiment; the red dot is the median read time for the  
>>>>>> read
>>>>>> size.
>>>>>> The graphs show the absolute read time.
>>>>>>
>>>>>> There's very interesting behavior - it seems that there is a  
>>>>>> change in
>>>>>> behavior around reads of size of 800KB.  The time for the reads  
>>>>>> go down
>>>>>> significantly when you read *larger* files.  I thought this was  
>>>>>> just an
>>>>>> artifact of the 64KB read-ahead I set in FUSE, so I upped the
>>>>>> read-ahead
>>>>>> significantly, to 1MB.  In this case, the difference between  
>>>>>> the the
>>>>>> small
>>>>>> read sizes and large read sizes are *very* pronounced.  If it  
>>>>>> was an
>>>>>> artifact from FUSE, I'd expect the place where the change  
>>>>>> occurred
>>>>>> would be
>>>>>> a function of the readahead-size.
>>>>>>
>>>>>> Anyone out there who knows the code have any ideas?  What could  
>>>>>> I be
>>>>>> doing wrong?
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>> <Hadoop_64KB_ra.png>
>>>>>>
>>>>>> <Hadoop_1024KB_ra.png>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>>