Return-Path: X-Original-To: apmail-giraph-user-archive@www.apache.org Delivered-To: apmail-giraph-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1EEA0CD7A for ; Thu, 27 Jun 2013 17:11:51 +0000 (UTC) Received: (qmail 63382 invoked by uid 500); 27 Jun 2013 17:11:50 -0000 Delivered-To: apmail-giraph-user-archive@giraph.apache.org Received: (qmail 63224 invoked by uid 500); 27 Jun 2013 17:11:49 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 62860 invoked by uid 99); 27 Jun 2013 17:11:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Jun 2013 17:11:48 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dboyd@data-tactics.com designates 208.73.111.65 as permitted sender) Received: from [208.73.111.65] (HELO mail.data-tactics-corp.com) (208.73.111.65) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Jun 2013 17:11:43 +0000 Received: from [172.16.2.52] (172.16.2.52) by mail.data-tactics-corp.com (10.1.2.25) with Microsoft SMTP Server (TLS) id 14.3.123.3; Thu, 27 Jun 2013 13:11:20 -0400 Message-ID: <51CC7235.2010601@data-tactics-corp.com> Date: Thu, 27 Jun 2013 13:11:17 -0400 From: David Boyd User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130307 Thunderbird/17.0.4 MIME-Version: 1.0 To: , Subject: Re: Optimal configuration for benchmark References: <51CC6656.9050604@ckrause.org> In-Reply-To: <51CC6656.9050604@ckrause.org> Content-Type: multipart/alternative; boundary="------------080407070306050908020909" X-Originating-IP: [172.16.2.52] X-Virus-Checked: Checked by ClamAV on apache.org --------------080407070306050908020909 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Christian: I have actually been looking for a more general Giraph benchmark and would love to test/play with what you have. To answer your questions we need to first assume a dedicated cluster where your test is the only one running. For number of mappers we will assume that your cluster is configured in a pseudo-standard one mapper per core (e.g. the max mappers for each node equals the number of cores on that node). For giraph, due to it being CPU centric it is pretty important that you not oversubscribe the cores. So for the number of mappers you should use - 1. This is because Giraph needs one mapper for the master node. HEAP_SIZE and mapred.map.child.java.opts are basically equivalent (but I prefer the latter). In any case, a part of this question depends on the what besides Hadoop is running on each node. Generally, you want each mapper to have as much heap space as possible. The goal is to avoid swapping, leave enough memory free for buffer cache, and have enough heap for each task that it does not need to spend a ton of time in garbage collection. I like to look at an idle node and see what the base overhead of used memory is. Then depending on the IO requirements of my job (especially read IO) reserve a portion of the remaining memory for buffer cache and then divide the remainder by the number of mappers. That is sort of the top down approach. A bottom up approach would look at the size of objects being managed/used in a mapper and compute upwards from there. That said a -Xmx 4G would be the low end of what I would specify. Also, you may want to set the options whcih change how Java does garbage collection. Hope this helps. On 6/27/2013 12:20 PM, Christian Krause wrote: > Hi, > > I implemented a benchmark that allows me to generate an arbitrarily > large graph (depending on the number of iterations). Now I would like > to configure Giraph so that I can make the best use of my hardware for > this benchmark. Based on the number of nodes in my cluster, their > amount of main memory and number of cores, I am asking myself how do I > determine the optimal parameters of Giraph / Hadoop, specifically: > > - the number of used mappers > - the HEAP_SIZE environment variable > - the memory specified in the mapred.map.child.java.opts property > > (any other relevant parameters?) > > Also, I was wondering how well Giraph can handle computations which > start with a very small graph and mutate it to a very large one. For > example, if I understand correctly the number of mappers is not > dynamically adjusted. > > Any hints (or links to documentation) are highly appreciated. > > Cheers, > Christian > -- ========= mailto:dboyd@data-tactics.com ============ David W. Boyd Director, Engineering 7901 Jones Branch, Suite 700 Mclean, VA 22102 office: +1-571-279-2122 fax: +1-703-506-6703 cell: +1-703-402-7908 ============== http://www.data-tactics.com.com/ ============ First Robotic Mentor - FRC, FTC - www.iliterobotics.org President - USSTEM Foundation - www.usstem.org The information contained in this message may be privileged and/or confidential and protected from disclosure. If the reader of this message is not the intended recipient or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by replying to this message and deleting the material from any computer. --------------080407070306050908020909 Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit
Christian:
      I have actually been looking for a more general Giraph benchmark and would love to test/play with
what you have.

      To answer your questions we need to first assume a dedicated cluster where your test is the only
one running.

       For number of mappers we will assume that your cluster is configured in a pseudo-standard one mapper
per core (e.g. the max mappers for each node equals the number of cores on that node).     For giraph, due to it
being CPU centric it is pretty important that you not oversubscribe the cores.

         So for the number of mappers you should use <total cluster mappers> - 1.   This is because Giraph needs
one mapper for the master node.

           HEAP_SIZE and mapred.map.child.java.opts are basically equivalent (but I prefer the latter).   In any case,
a part of this question depends on the what besides Hadoop is running on each node.   Generally, you want each
mapper to have as much heap space as possible.  The goal is to avoid swapping, leave enough memory free for buffer
cache, and have enough heap for each task that it does not need to spend a ton of time in garbage collection.
I like to look at an idle node and see what the base overhead of used memory is.  Then depending on the IO
requirements of my job (especially read IO) reserve a portion of the remaining memory for buffer cache and then
divide the remainder by the number of mappers.

              That is sort of the top down approach.  A bottom up approach would look at the size of objects being
managed/used in a mapper and compute upwards from there.  

               That said a -Xmx 4G would be the low end of what I would specify.   Also, you may want to set the options
whcih change how Java does garbage collection.

Hope this helps.

On 6/27/2013 12:20 PM, Christian Krause wrote:
Hi,

I implemented a benchmark that allows me to generate an arbitrarily large graph (depending on the number of iterations). Now I would like to configure Giraph so that I can make the best use of my hardware for this benchmark. Based on the number of nodes in my cluster, their amount of main memory and number of cores, I am asking myself how do I determine the optimal parameters of Giraph / Hadoop, specifically:

- the number of used mappers
- the HEAP_SIZE environment variable
- the memory specified in the mapred.map.child.java.opts property

(any other relevant parameters?)

Also, I was wondering how well Giraph can handle computations which start with a very small graph and mutate it to a very large one. For example, if I understand correctly the number of mappers is not dynamically adjusted.

Any hints (or links to documentation) are highly appreciated.

Cheers,
Christian



-- 
========= mailto:dboyd@data-tactics.com ============
David W. Boyd                     
Director, Engineering       
7901 Jones Branch, Suite 700   
Mclean, VA 22102         
office:   +1-571-279-2122    
fax:     +1-703-506-6703    
cell:     +1-703-402-7908
============== http://www.data-tactics.com.com/ ============
First Robotic Mentor - FRC, FTC - www.iliterobotics.org
President - USSTEM Foundation - www.usstem.org

The information contained in this message may be privileged 
and/or confidential and protected from disclosure.  
If the reader of this message is not the intended recipient 
or an employee or agent responsible for delivering this message 
to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication 
is strictly prohibited.  If you have received this communication 
in error, please notify the sender immediately by replying to 
this message and deleting the material from any computer.

 
--------------080407070306050908020909--