hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <ma...@cloudera.com>
Subject Re: Hadoop/HDFS for scientific simulation output data analysis
Date Fri, 03 Apr 2009 15:21:10 GMT
Hi Tiankai,

The one strange thing I see in your configuration as described is IO buffer
size and IO bytes per checksum set to 64 MB. This is much higher than the
recommended defaults, which are about 64 KB for buffer size and 1 KB or 512
bytes for checksum. (Actually I haven't seen anyone change checksum from its
default of 512 bytes). Having huge buffers is bad for memory consumption and
cache locality.

The other thing that bothers me is that on your 64 MB data set, you have 28
TB of HDFS bytes read. This is off from number of map tasks * bytes per map
by an order of magnitude. Are you sure that you've generated the data set
correctly and that it's the only input path given to your job? Does
bin/hadoop dfs -dus <path to dataset> come out as 1.6 TB?

Matei

On Sat, Mar 28, 2009 at 4:10 PM, Tu, Tiankai
<Tiankai.Tu@deshawresearch.com>wrote:

> Hi,
>
> I have been exploring the feasibility of using Hadoop/HDFS to analyze
> terabyte-scale scientific simulation output datasets. After a set of
> initial experiments, I have a number of questions regarding (1) the
> configuration setting and (2) the IO read performance.
>
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> --------------------------------------------------------------
> Unlike typical Hadoop applications, post-simulation analysis usually
> processes one file at a time. So I wrote a
> WholeFileInputFormat/WholeFileRecordReader that reads an entire file
> without parsing the content, as suggested by the Hadoop wiki FAQ.
>
> Specifying WholeFileInputFormat as as input file format
> (conf.setInputFormat(FrameInputFormat.class), I constructed a simple
> MapReduce program with the sole purpose to measure how fast Hadoop/HDFS
> can read data. Here is the gist of the test program:
>
> - The map method does nothing, it returns immediately when called
> - No reduce task (conf.setNumReduceTasks(0)
> - JVM reused (conf.setNumTasksToExecutePerJvm(-1))
>
> The detailed hardware/software configurations are listed below:
>
> Hardware:
> - 103 nodes, each with two 2.33GHz quad-core processors and 16 GB memory
> - 1 GigE connection out of each node and connecting to a 1GigE switch in
> the rack (3 racks in total)
> - Each rack switch has 4 10-GigE connections to a backbone
> full-bandwidth 10-GigE switch (second-tier switch)
> - Software (md) RAID0 on 4 SATA disks, with a capacity of 500 GB per
> node
> - Raw RAID0 bulk data transfer rate around 200 MB/s  (dd a 4GB file
> after dropping linux vfs cache)
>
> Software:
> - 2.6.26-10smp kernel
> - Hadoop 0.19.1
> - Three nodes as namenode, secondary name node, and job tracker,
> respectively
> - Remaining 100 node as slaves, each running as both datanode and
> tasktracker
>
> Relevant hadoop-site.xml setting:
> - dfs.namenode.handler.count = 100
> - io.file.buffer.size = 67108864
> - io.bytes.per.checksum = 67108864
> - mapred.task.timeout = 1200000
> - mapred.map.max.attempts = 8
> - mapred.tasktracker.map.tasks.maximum = 8
> - dfs.replication = 3
> - toploogy.script.file.name set properly to a correct Python script
>
> Dataset characteristics:
>
> - Four datasets consisting of files of 1 GB, 256 MB, 64 MB, and 2 MB,
> respectively
> - Each dataset has 1.6 terabyte data (that is, 1600 1GB files, 6400
> 256MB files, etc.)
> - Datasets populated into HDFS via a parallel C MPI program (linked with
> libhdfs.so) running on the 100 slave nodes
> - dfs block size set to be the file size (otherwise, accessing a single
> file may require network data transfer)
>
> I launched the test MapReduce job one after another (so there was no
> interference) and collected the following performance results:
>
> Dataset name, Finished in,  Failed/Killed task attempts, HDFS bytes read
> (Map=Total), Rack-local map tasks, Launched map tasks, data-local map
> tasks
>
> 1GB file dataset, 16mins11sec, 0/382, (2,578,054,119,424), 98, 1982,
> 1873
> 256MB file dataset, 50min9sec,0/397, (7,754,295,017,472), 156, 6797,
> 6639
> 64MB file dataset,4hrs18mins21sec,394/251,(28,712,795,897,856), 153,
> 26245, 26068
>
> The job for the 2MB file dataset failed to run due to the following
> error:
>
> 09/03/27 21:39:58 INFO mapred.FileInputFormat: Total input paths to
> process : 819200
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>        at java.util.Arrays.copyOf(Arrays.java:2786)
>        at
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71)
>        at java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
>        at org.apache.hadoop.io.UTF8.writeChars(UTF8.java:274)
>
> After running into this error, the job tracker no longer accepted jobs.
> I stopped and restarted the job tracker with a larger heap size setup
> (8GB). But it still didn't accept new jobs.
>
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> --------------------------------------------------------------
> Questions:
>
> (1) Why reading 1GB files is signfiicantly faster than reading smaller
> file sizes, even though reading a 256MB file is as much a bulk transfer?
>
> (2)  Why are the reported HDFS bytes read signfiicantly higher than the
> dataset size (1.6TB)? (The percentage of failed/killed tasks was much
> lower than the extra bytes read.)
>
> (3) What is the maximum number (roughly) of input paths the job tracker
> can handle? (For scientific simulation output dataset, it is quite
> commonplace to have hundreds of thousands to millions of files.)
>
> (4) Even for the 1GB file dataset, considering the percentage of
> data-local map tasks (94.5%), the overall end-to-end read bandwidth
> (1.69GB/s) is much lower than the potential performance offered by the
> hardware (200MB/s local RAID0 read performance, multiplied by 100 slave
> nodes). What are the settings I should change (either in the test
> MapReduce program or in the config files) to obtain better performance?
>
> Thank you very much.
>
> Tiankai
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message