giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claudio Martella <claudio.marte...@gmail.com>
Subject Re: Basic questions about Giraph internals
Date Thu, 06 Feb 2014 14:12:12 GMT
On Thu, Feb 6, 2014 at 3:04 PM, Alexander Frolov
<alexndr.frolov@gmail.com>wrote:

>
> Claudio,
> thank you very much for your help.
>
> On Thu, Feb 6, 2014 at 4:06 PM, Claudio Martella <
> claudio.martella@gmail.com> wrote:
>
>>
>>
>>
>> On Thu, Feb 6, 2014 at 12:15 PM, Alexander Frolov <
>> alexndr.frolov@gmail.com> wrote:
>>
>>>
>>>
>>>
>>> On Thu, Feb 6, 2014 at 3:00 PM, Claudio Martella <
>>> claudio.martella@gmail.com> wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Thu, Feb 6, 2014 at 11:56 AM, Alexander Frolov <
>>>> alexndr.frolov@gmail.com> wrote:
>>>>
>>>>> Hi Claudio,
>>>>>
>>>>> thank you.
>>>>>
>>>>> If I understood correctly, mapper and mapper task is the same thing.
>>>>>
>>>>
>>>> More or less. A mapper is a functional element of the programming
>>>> model, while the mapper task is the task that executes the mapper function
>>>> on the records.
>>>>
>>>
>>> Ok, I see. Then mapred.tasktracker.map.tasks.maximum is a maximum number
>>> of Workers [or Workers + Master] which will be created at the same node.
>>>
>>> That is if I have 8 node cluster
>>> with mapred.tasktracker.map.tasks.maximum=4, then I can run up to 31
>>> Workers + 1 Master.
>>>
>>> Is it correct?
>>>
>>
>> That is correct. However, if you have total control over your cluster,
>> you may want to run one worker per node (hence setting the max number of
>> map tasks per machine to 1), and use multiple threads (input, compute,
>> output).
>> This is going to make better use of resources.
>>
>
> Should I explicitly force Giraph to use multiple threads for input,
> compute, output? Only three threads, I suppose? But I have 12 cores
> available in each node (24 if HT is enabled).
>

You're right, I was not clear. I suggest you use N threads for each of
those three classes, where N is something close to the number of processing
units (e.g. cores) you have available on each machine.
Consider that Giraph has a number of other threads running in the
background, for example to handle communication etc. I suggest you try
different setups through benchmarking.



>
>
>>
>>
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 6, 2014 at 2:28 PM, Claudio Martella <
>>>>> claudio.martella@gmail.com> wrote:
>>>>>
>>>>>> Hi Alex,
>>>>>>
>>>>>> answers are inline.
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 6, 2014 at 11:22 AM, Alexander Frolov <
>>>>>> alexndr.frolov@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, folks!
>>>>>>>
>>>>>>> I have started small research of Giraph framework and I have
not
>>>>>>> much experience with Giraph and Hadoop :-(.
>>>>>>>
>>>>>>> I would like to ask several questions about how things are working
>>>>>>> in Giraph which are not straightforward for me. I am trying to
use the
>>>>>>> sources but sometimes it is not too easy ;-)
>>>>>>>
>>>>>>> So here they are:
>>>>>>>
>>>>>>> 1) How Workers are assigned to TaskTrackers?
>>>>>>>
>>>>>>
>>>>>> Each worker is a mapper, and mapper tasks are assigned to
>>>>>> tasktrackers by the jobtracker.
>>>>>>
>>>>>
>>>>> That is each Worker is created at the beginning of superstep and then
>>>>> dies. In the next superstep all Workers are created again. Is it correct?
>>>>>
>>>>
>>>> Nope. The workers are created at the beginning of the computation, and
>>>> destroyed at the end of the computation. A computation is persistent
>>>> throughout the computation.
>>>>
>>>>
>>>>>
>>>>>
>>>>>> There's no control by Giraph there, and because Giraph doesn't need
>>>>>> data-locality like Mapreduce does, basically nothing is done.
>>>>>>
>>>>>
>>>>> This is important for me. So Giraph Worker (a.k.a Hadoop mapper)
>>>>> fetches vertex with corresponding index from the HDFS and perform
>>>>> computation. What does it do next with it? As I understood Giraph is
fully
>>>>> in-memory framework and in the next superstep this vertex should be fetched
>>>>> from the memory by the same Worker. Where the vertices are stored between
>>>>> supersteps? In HDFS or in memory?
>>>>>
>>>>
>>>> As I said, the workers are persistent (in-memory) between supersteps,
>>>> so they keep everything in memory.
>>>>
>>>
>>> Ok.
>>>
>>> Is there any means to see assignment of Workers to TaskTrackers during
>>> or after the computation?
>>>
>>
>> The jobtracker http interface will show you the mapper running, hence i'd
>> check there
>>
>>
>>>
>>> And is there any means to see assignment of vertices to Workers (as
>>> distribution function, histogram etc.)?
>>>
>>
>> You can check the worker logs, I think the information should be there.
>>
>>
>>>
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> 2) How vertices are assigned to Workers? Does it depend on
>>>>>>> distribution of input file on DataNodes? Is there available any
choice of
>>>>>>> distribution politics or no?
>>>>>>>
>>>>>>
>>>>>> In the default scheme, vertices are assigned through modulo hash
>>>>>> partitioning. Given k workers, vertex v is assigned to worker i according
>>>>>> to hash(v) % k = i.
>>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> 3) How Workers and Map tasks are related to each other? (1:1)?
>>>>>>> (n:1)? (1:n)?
>>>>>>>
>>>>>>
>>>>>> It's 1:1. Each worker is implemented by a mapper task. The master
is
>>>>>> usually (but does not need to) implemented by an additional mapper
>>>>>>
>>>>> .
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> 4) Can Workers migrate from one TaskTracker to the other?
>>>>>>>
>>>>>>
>>>>>> Workers does not migrate. A Giraph computation is not dynamic wrt
to
>>>>>> assignment and size of the tasks.
>>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> 5) What is the best way to monitor Giraph app execution (progress,
>>>>>>> worker assignment, load balancing etc.)?
>>>>>>>
>>>>>>
>>>>>> Just like you would for a standard Mapreduce job. Go to the job page
>>>>>> on the jobtracker http page.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I think this is all for the moment. Thank you.
>>>>>>>
>>>>>>> Testbed description:
>>>>>>> Hardware: 8 node dual-CPU cluster with IB FDR.
>>>>>>> Giraph: release-1.0.0-RC2-152-g585511f
>>>>>>> Hadoop: hadoop-0.20.203.0, hadoop-rdma-0.9.8
>>>>>>>
>>>>>>> Best,
>>>>>>>    Alex
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>    Claudio Martella
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>    Claudio Martella
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>>    Claudio Martella
>>
>>
>
>


-- 
   Claudio Martella

Mime
View raw message