hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tim robertson <timrobertson...@gmail.com>
Subject Re: Hardware performance from HADOOP cluster
Date Fri, 16 Oct 2009 11:01:11 GMT
Hi all,

Adding the following to core-site.xml, mapred-site.xml and
hdfs-site.xml (based on Cloudera guidelines:
  io.sort.factor: 15  (mapred-site.xml)
  io.sort.mb: 150  (mapred-site.xml)
  io.file.buffer.size: 65536   (core-site.xml)
  dfs.datanode.handler.count: 3 (hdfs-site.xml  actually this is the default)

and using the default of HADOOP_HEAPSIZE=1000 (hadoop-env.sh)

Using 2 mappers and 2 reducers, can someone please help me with the
maths as to why my jobs are failing with "Error: Java heap space" in
the maps?
(the same runs fine with io.sort.factor of 10 and io.sort.mb of 100)

io.sort.mb of 200 x 4 (2 mappers, 2 reducers) = 0.8G
Plus the 2 daemons on the node at 1G each = 1.8G
Plus Xmx of 1G for each hadoop daemon task = 5.8G

The machines have 8G in them.  Obviously my maths is screwy somewhere...


On Fri, Oct 16, 2009 at 9:59 AM, Erik Forsberg <forsberg@opera.com> wrote:
> On Thu, 15 Oct 2009 11:32:35 +0200
> Usman Waheed <usmanw@opera.com> wrote:
>> Hi Todd,
>> Some changes have been applied to the cluster based on the
>> documentation (URL) you noted below,
> I would also like to know what settings people are tuning on the
> operating system level. The blog post mentioned here does not mention
> much about that, except for the fileno changes.
> We got about 3x the read performance when running DFSIOTest by mounting
> our ext3 filesystems with the noatime parameter. I saw that mentioned
> in the slides from some Cloudera presentation.
> (For those who don't know, the noatime parameter turns off the
> recording of access time on files. That's a horrible performance killer
> since it means every read of a file also means that the kernel must do
> a write. These writes are probably queued up, but still, if you don't
> need the atime (very few applications do), turn it off!)
> Have people been experimenting with different filesystems, or are most
> of us running on top of ext3?
> How about mounting ext3 with "data=writeback"? That's rumoured to give
> the best throughput and could help with write performance. From
> mount(8):
>     writeback
>            Data ordering is not preserved - data may be written into the main file
>            after its metadata has been  committed  to the journal.  This is
rumoured to be the
>            highest throughput option.  It guarantees internal file system integrity,
>            however it can allow old data to appear in files after a crash and journal
> How would the HDFS consistency checks cope with old data appearing in
> the unerlying files after a system crash?
> Cheers,
> \EF
> --
> Erik Forsberg <forsberg@opera.com>
> Developer, Opera Software - http://www.opera.com/

View raw message