hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: HDFS random read performance vs sequential read performance ?
Date Mon, 20 Jul 2009 17:23:19 GMT
>From your description, I'm not entirely clear on what tests you're
comparing. It seems almost like you're comparing your data on your hardware
against some data someone else posted on an entirely different cluster. Can
you upload a complete working Java program that reproduces all of your
results (sequential, nonsequential, etc) somewhere?

-Todd

On Mon, Jul 20, 2009 at 12:41 AM, Martin Mituzas <xietao1981@hotmail.com>wrote:

>
> hi, all
> I see there are two read in DFSInputStream:
>
> int read(byte buf[], int off, int len)
> int read(long position, byte[] buffer, int offset, int length)
>
> And I use the following code test the read performance.
> Before test I generate some files in the directory DATA_DIR, then I run
> this
> function for some time and calculate the read throughput.
> The initFiles() function is borrowed from the patch
> https://issues.apache.org/jira/browse/HDFS-236.
> My question is I tried above two read methods and found the throughput have
> huge difference. The results are attached below. Is there something wrong
> with my code ? I cann't believe there can be such big difference...
> And in https://issues.apache.org/jira/browse/HDFS-236, I saw the following
> performance data posted by Raghu Angadi :
>
> Description of read      Time for each read in ms
> 1000 native reads over block files      09.5
> Random Read 10x500      10.8
> Random Read without CRC         10.5
> Random Read with 'seek() and read()'    12.5
> Read with sequential offsets    01.7
> 1000 native reads without closing files         07.5
>
> So based on this data, sequential read is about 6x faster than random read
> which is reasonable, and my data seems unreasonable. Anybody provides some
> comments?
>
> Here is my test result.
>
> with first read:
> test type,read size,read ops,start time,end time,test time,real read
> time,throughput
> sequence read,64628740096,15778506,[2009-07-20 14:47:01 704],[2009-07-20
> 14:53:41 704],400,400,154.09
>
> with second read:
> test type,read size,read ops,start time,end time,test time,real read
> time,throughput
> sequence read,2400047104,585949,[2009-07-20 14:59:50 328],[2009-07-20
> 15:06:30 328],400,400,5.72
>
> My cluster: 1 name node + 3 data nodes, replication = 3.
> And my code:
>
> private void sequenceRead(long time) throws IOException {
>
>        byte[] data = new byte[bufferSize];
>        Random rand = new Random();
>        initFiles(DATA_DIR);
>        long period = time * 1000;
>        FSDataInputStream in = null;
>        long totalSize = 0;
>        long readCount = 0;
>        long offset = 0;
>        int index = (rand.nextInt() & Integer.MAX_VALUE ) % fileList.size();
>        if(barrier()){
>                  start = System.currentTimeMillis();
>                  while(System.currentTimeMillis() - start < period){
>                          if(in == null){
>                                  FileInfo file =
> (FileInfo)fileList.get(index);
>                                  in = file.fileStream;
>                                  if(in == null){
>                                          in = fs.open(file.filePath);
>                                          file.fileStream = in;
>                                  }
>                                  index = (index ++) % fileList.size();
>                          }
>                          long actualSize = in.read(offset, data, 0,
> bufferSize);
>                          //long actualSize = in.read(data,0,bufferSize);
>                          readCount ++;
>
>                          if(actualSize > 0){
>                                  totalSize += actualSize;
>                                  offset += actualSize;
>                          }
>                          if(actualSize < bufferSize) {
>                                  //in.seek(0);
>                                  in = null;
>                                  offset = 0;
>                          }
>                  }
>                  out.close();
>                  end = System.currentTimeMillis();
>
>                  for(FileInfo finfo : fileList){
>                          if(finfo.fileStream != null)
>                                  IOUtils.closeStream(finfo.fileStream);
>                  }
>                  System.out.println("test type,read size,read ops,start
> time,end time,test time,real read time,throughput");
>                  String s = String.format("sequence
> read,%d,%d,[%s],[%s],%d,%d,%.2f",
>                                  totalSize,
>                                  readCount,
>                                  sdf.format(new Date(start)),
>                                  sdf.format(new Date(end)),
>                                  time,
>                                  (end-start)/1000,
>                                  (double)(totalSize*1000)/(double)((end -
> start)*1024*1024));
>                  System.out.println(s);
>        }
>  }
>
>
>
> --
> View this message in context:
> http://www.nabble.com/HDFS-random-read-performance-vs-sequential-read-performance---tp24565264p24565264.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message