giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <ach...@apache.org>
Subject Re: What a "worker" really is and other interesting runtime information
Date Wed, 28 Nov 2012 23:20:56 GMT
Oh, forgot one thing.  You need to set the number of partitions to use 
single each thread works on a single partition at a time.

Try -Dhash.userPartitionCount=<number of threads>

On 11/28/12 5:29 AM, Alexandros Daglis wrote:
> Dear Avery,
>
> I followed your advice, but the application seems to be totally 
> thread-count-insensitive: I literally observe zero scaling of 
> performance, while I increase the thread count. Maybe you can point 
> out if I am doing something wrong.
>
> - Using only 4 cores on a single node at the moment
> - Input graph: 14 million vertices, file size is 470 MB
> - Running SSSP as follows: hadoop jar 
> target/giraph-0.1-jar-with-dependencies.jar 
> org.apache.giraph.examples.SimpleShortestPathsVertex 
> -Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=X input 
> output 12 1
> where X=1,2,3,12,30
> - I notice a total insensitivity to the number of thread I specify. 
> Aggregate core utilization is always approximately the same (usually 
> around 25-30% => only one of the cores running) and overall execution 
> time is always the same (~8 mins)
>
> Why is Giraph's performance not scaling? Is the input size / number of 
> workers inappropriate? It's not an IO issue either, because even 
> during really low core utilization, time is wasted on idle, not on IO.
>
> Cheers,
> Alexandros
>
>
>
> On 28 November 2012 11:13, Alexandros Daglis 
> <alexandros.daglis@epfl.ch <mailto:alexandros.daglis@epfl.ch>> wrote:
>
>     Thank you Avery, that helped a lot!
>
>     Regards,
>     Alexandros
>
>
>     On 27 November 2012 20:57, Avery Ching <aching@apache.org
>     <mailto:aching@apache.org>> wrote:
>
>         Hi Alexandros,
>
>         The extra task is for the master process (a coordination
>         task). In your case, since you are using a single machine, you
>         can use a single task.
>
>         -Dgiraph.SplitMasterWorker=false
>
>         and you can try multithreading instead of multiple workers.
>
>         -Dgiraph.numComputeThreads=12
>
>         The reason why cpu usage increases is due to netty threads to
>         handle network requests.  By using multithreading instead, you
>         should bypass this.
>
>         Avery
>
>
>         On 11/27/12 9:40 AM, Alexandros Daglis wrote:
>
>             Hello everybody,
>
>             I went through most of the documentation I could find for
>             Giraph and also most of the messages in this email list,
>             but still I have not figured out precisely what a "worker"
>             really is. I would really appreciate it if you could help
>             me understand how the framework works.
>
>             At first I thought that a worker has a one-to-one
>             correspondence to a map task. Apparently this is not
>             exactly the case, since I have noticed that if I ask for x
>             workers, the job finishes after having used x+1 map tasks.
>             What is this extra task for?
>
>             I have been trying out the example SSSP application on a
>             single node with 12 cores. Giving an input graph of ~400MB
>             and using 1 worker, around 10 GBs of memory are used
>             during execution. What intrigues me is that if I use 2
>             workers for the same input (and without limiting memory
>             per map task), double the memory will be used.
>             Furthermore, there will be no improvement in performance.
>             I rather notice a slowdown. Are these observations normal?
>
>             Might it be the case that 1 and 2 workers are very few and
>             I should go to the 30-100 range that is the proposed
>             number of mappers for a conventional MapReduce job?
>
>             Finally, a last observation. Even though I use only 1
>             worker, I see that there are significant periods during
>             execution where up to 90% of the 12 cores computing power
>             is consumed, that is, almost 10 cores are used in
>             parallel. Does each worker spawn multiple threads and
>             dynamically balances the load to utilize the available
>             hardware?
>
>             Thanks a lot in advance!
>
>             Best,
>             Alexandros
>
>
>
>
>


Mime
View raw message