giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: How to specify parameters in order to run giraph job in parallel
Date Fri, 18 Oct 2013 17:31:29 GMT
Da,

Holding objects in serialized form as bytes in byte arrays consumes much
less memory than holding them as Java objects (which have a huge
overhead), I think that is the other main reason for serialization.

--sebastian

On 18.10.2013 19:28, YAN Da wrote:
> Dear Claudio Martella,
> 
> According to https://reviews.apache.org/r/7990/diff/?page=2, Giraph
> currently organize vertices as byte streams, probabily pages.
> 
> In the url, "This also significantly reduces GC time, as there are less
> objects to GC."
> 
> Why there's "also" there? I mean, is reducing GC time the only reason for
> doing serialization?
> 
> Regards,
> Da
> 
>> Dear Claudio Martella,
>>
>> I don't quite get what you mean. Our cluster has 15 servers each with 24
>> cores, so ideally there can be 15*24 threads/partitions work in parallel,
>> right? (Perhaps deduct one for ZooKeeper)
>>
>> However, when we set the "-Dgiraph.numComputeThreads" option, we find that
>> we cannot have even 20 threads, and when set to 10, the CPU usage is just
>> a little bit doubles that of the default setting, not anything close to
>> 100*numComputeThreads%.
>>
>> How can we set it to work on our server to utilize all the processors?
>>
>> Regards,
>> Da Yan
>>
>>> It actually depends on the setup of your cluster.
>>>
>>> Ideally, with 15 nodes (tasktrackers) you'd want 1 mapper slot per node
>>> (ideally to run giraph), so that you would have 14 workers, one per
>>> computing node, plus one for master+zookeeper. Once that is reached, you
>>> would have a number of compute threads equals to the number of threads
>>> that
>>> you can run on each node (24 in your case).
>>>
>>> Does this make sense to you?
>>>
>>>
>>> On Thu, Oct 17, 2013 at 5:04 PM, Yi Lu <luyi0619@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a computer cluster consisting of 15 slave machines and 1 master
>>>> machine.
>>>>
>>>> On each slave machine, there are two Xeon E5-2620 CPUs. With the help
>>>> of
>>>> HT, there are 24 threads.
>>>>
>>>> I am wondering how to specify parameters in order to run giraph job in
>>>> parallel on my cluster.
>>>>
>>>> I am using the following parameters to run a pagerank algorithm.
>>>>
>>>> hadoop jar ~/giraph-examples.jar org.apache.giraph.GiraphRunner
>>>> SimplePageRank -vif PageRankInputFormat -vip /input -vof
>>>> PageRankOutputFormat -op /pagerank -w 1 -mc
>>>> SimplePageRank\$SimplePageRankMasterCompute -wc
>>>> SimplePageRank\$SimplePageRankWorkerContext
>>>>
>>>> In particular,
>>>>
>>>> 1)I know I can use “-w” to specify the number of workers. In my
>>>> opinion,
>>>> the number of workers equals to the number of mappers in hadoop except
>>>> zookeeper. Therefore, in my case(15 slave machine), which number should
>>>> be
>>>> chosen? Is 15 a good choice? Since, I find if I input a large number,
>>>> e.g.
>>>> 100, the mappers will hang.
>>>>
>>>> 2)I know I can use “-Dgiraph.numComputeThreads=1” to specify vertex
>>>> computing thread number. However, if I specify it to 10, the total
>>>> runtime
>>>> is much longer than default. I think the default is 1, which is found
>>>> in
>>>> the source code. I wonder if I want to use this parameter, which number
>>>> should be chosen.
>>>>
>>>> 3)When the giraph job is running, I use “top” command to monitor my cpu
>>>> usage on slave machines. I find that the java process can use 200%-300%
>>>> cpu
>>>> resource. However, if I change the number of vertex computing threads
>>>> to
>>>> 10, the java process can use 800% cpu resource. I think it is not a
>>>> linear
>>>> relation and I want to know why.
>>>>
>>>>
>>>> Thanks for your help.
>>>>
>>>> Best,
>>>>
>>>> -Yi
>>>>
>>>
>>>
>>>
>>> --
>>>    Claudio Martella
>>>    claudio.martella@gmail.com
>>>
>>
>>
>>
> 
> 
> 
> 


Mime
View raw message