giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claudio Martella <claudio.marte...@gmail.com>
Subject Re: Basic questions about Giraph internals
Date Fri, 07 Feb 2014 15:19:21 GMT
 giraph.numComputeThreads integer1 Number of threads for vertex computation
giraph.numInputThreads integer 1Number of threads for input split loading
giraph.numOutputThreads integer1 Number of threads for writing output in
the end of the application


On Fri, Feb 7, 2014 at 4:17 PM, Sertu─č Kaya <sertug.kaya@agmlab.com> wrote:

>  Hi all;
> Thanks for this resourceful Q&A's. I will also definitely try this one
> mapper-multiple thread setting per node.
> But Claudio, in which configuration do you set multiple threads?
> Thanks
> Sertug
>
>
> On 06-02-2014 16:04, Alexander Frolov wrote:
>
>
> Claudio,
> thank you very much for your help.
>
> On Thu, Feb 6, 2014 at 4:06 PM, Claudio Martella <
> claudio.martella@gmail.com> wrote:
>
>>
>>
>>
>>  On Thu, Feb 6, 2014 at 12:15 PM, Alexander Frolov <
>> alexndr.frolov@gmail.com> wrote:
>>
>>>
>>>
>>>
>>>  On Thu, Feb 6, 2014 at 3:00 PM, Claudio Martella <
>>> claudio.martella@gmail.com> wrote:
>>>
>>>>
>>>>
>>>>
>>>>  On Thu, Feb 6, 2014 at 11:56 AM, Alexander Frolov <
>>>> alexndr.frolov@gmail.com> wrote:
>>>>
>>>>> Hi Claudio,
>>>>>
>>>>>  thank you.
>>>>>
>>>>>  If I understood correctly, mapper and mapper task is the same thing.
>>>>>
>>>>
>>>>  More or less. A mapper is a functional element of the programming
>>>> model, while the mapper task is the task that executes the mapper function
>>>> on the records.
>>>>
>>>
>>>  Ok, I see. Then mapred.tasktracker.map.tasks.maximum is a maximum
>>> number of Workers [or Workers + Master] which will be created at the same
>>> node.
>>>
>>>  That is if I have 8 node cluster
>>> with mapred.tasktracker.map.tasks.maximum=4, then I can run up to 31
>>> Workers + 1 Master.
>>>
>>>  Is it correct?
>>>
>>
>>  That is correct. However, if you have total control over your cluster,
>> you may want to run one worker per node (hence setting the max number of
>> map tasks per machine to 1), and use multiple threads (input, compute,
>> output).
>> This is going to make better use of resources.
>>
>
>  Should I explicitly force Giraph to use multiple threads for input,
> compute, output? Only three threads, I suppose? But I have 12 cores
> available in each node (24 if HT is enabled).
>
>
>>
>>
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>  On Thu, Feb 6, 2014 at 2:28 PM, Claudio Martella <
>>>>> claudio.martella@gmail.com> wrote:
>>>>>
>>>>>> Hi Alex,
>>>>>>
>>>>>>  answers are inline.
>>>>>>
>>>>>>
>>>>>>  On Thu, Feb 6, 2014 at 11:22 AM, Alexander Frolov <
>>>>>> alexndr.frolov@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, folks!
>>>>>>>
>>>>>>>  I have started small research of Giraph framework and I have
not
>>>>>>> much experience with Giraph and Hadoop :-(.
>>>>>>>
>>>>>>>  I would like to ask several questions about how things are working
>>>>>>> in Giraph which are not straightforward for me. I am trying to
use the
>>>>>>> sources but sometimes it is not too easy ;-)
>>>>>>>
>>>>>>>  So here they are:
>>>>>>>
>>>>>>>  1) How Workers are assigned to TaskTrackers?
>>>>>>>
>>>>>>
>>>>>>  Each worker is a mapper, and mapper tasks are assigned to
>>>>>> tasktrackers by the jobtracker.
>>>>>>
>>>>>
>>>>>  That is each Worker is created at the beginning of superstep and
>>>>> then dies. In the next superstep all Workers are created again. Is it
>>>>> correct?
>>>>>
>>>>
>>>>  Nope. The workers are created at the beginning of the computation,
>>>> and destroyed at the end of the computation. A computation is persistent
>>>> throughout the computation.
>>>>
>>>>
>>>>>
>>>>>
>>>>>>   There's no control by Giraph there, and because Giraph doesn't
>>>>>> need data-locality like Mapreduce does, basically nothing is done.
>>>>>>
>>>>>
>>>>>  This is important for me. So Giraph Worker (a.k.a Hadoop mapper)
>>>>> fetches vertex with corresponding index from the HDFS and perform
>>>>> computation. What does it do next with it? As I understood Giraph is
fully
>>>>> in-memory framework and in the next superstep this vertex should be fetched
>>>>> from the memory by the same Worker. Where the vertices are stored between
>>>>> supersteps? In HDFS or in memory?
>>>>>
>>>>
>>>>  As I said, the workers are persistent (in-memory) between supersteps,
>>>> so they keep everything in memory.
>>>>
>>>
>>>   Ok.
>>>
>>>  Is there any means to see assignment of Workers to TaskTrackers during
>>> or after the computation?
>>>
>>
>>  The jobtracker http interface will show you the mapper running, hence
>> i'd check there
>>
>>
>>>
>>>  And is there any means to see assignment of vertices to Workers (as
>>> distribution function, histogram etc.)?
>>>
>>
>>  You can check the worker logs, I think the information should be there.
>>
>>
>>>
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>  2) How vertices are assigned to Workers? Does it depend on
>>>>>>> distribution of input file on DataNodes? Is there available any
choice of
>>>>>>> distribution politics or no?
>>>>>>>
>>>>>>
>>>>>>  In the default scheme, vertices are assigned through modulo hash
>>>>>> partitioning. Given k workers, vertex v is assigned to worker i according
>>>>>> to hash(v) % k = i.
>>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>  3) How Workers and Map tasks are related to each other? (1:1)?
>>>>>>> (n:1)? (1:n)?
>>>>>>>
>>>>>>
>>>>>>  It's 1:1. Each worker is implemented by a mapper task. The master
>>>>>> is usually (but does not need to) implemented by an additional mapper
>>>>>>
>>>>>   .
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>  4) Can Workers migrate from one TaskTracker to the other?
>>>>>>>
>>>>>>
>>>>>>  Workers does not migrate. A Giraph computation is not dynamic wrt
>>>>>> to assignment and size of the tasks.
>>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>  5) What is the best way to monitor Giraph app execution (progress,
>>>>>>> worker assignment, load balancing etc.)?
>>>>>>>
>>>>>>
>>>>>>  Just like you would for a standard Mapreduce job. Go to the job
>>>>>> page on the jobtracker http page.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>  I think this is all for the moment. Thank you.
>>>>>>>
>>>>>>>  Testbed description:
>>>>>>>  Hardware: 8 node dual-CPU cluster with IB FDR.
>>>>>>>  Giraph: release-1.0.0-RC2-152-g585511f
>>>>>>> Hadoop: hadoop-0.20.203.0, hadoop-rdma-0.9.8
>>>>>>>
>>>>>>>  Best,
>>>>>>>    Alex
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>>     Claudio Martella
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>  --
>>>>     Claudio Martella
>>>>
>>>>
>>>
>>>
>>
>>
>>  --
>>     Claudio Martella
>>
>>
>
>
>


-- 
   Claudio Martella

Mime
View raw message