giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maja Kabiljo" <majakabi...@fb.com>
Subject Re: Review Request: Fix log messages produced by aggregators
Date Wed, 20 Mar 2013 03:29:09 GMT


> On March 19, 2013, 10:08 p.m., Alessandro Presta wrote:
> > Thanks for taking the time to revise this code.
> > Can you please post a sample log?
> > The main thing I'm interested in is telling the user what we're waiting on (we're
waiting on certain machines to send certain numbers of requests relative to aggregators).
> > Another issue is that this mechanism obscures which machines are actually ready
and which ones are still working. It would be good to have each worker print which workers
it's waiting on, when they fall under a certain number (like we do with other requests), since
we can't see that in the master status anymore.
> 
> Maja Kabiljo wrote:
>     I understood that, it's what this patch is providing :-)
>     Here is the log example:
>     
>     2013-03-19 14:46:09,795 [main] org.apache.giraph.utils.TaskIdsPermitsBarrier  - waitForRequiredPermits:
Waiting for 10 more tasks to send their data, task ids: [1, 4, 5, 9, 11, 13, 16, 25, 26, 28]
>     
>     I set the limit for logging the exact ids to 10.
>     No in the complicated case when we have a lot of aggregator requests, and if each
worker sent at least something, we still won't know what workers are we waiting for. But in
the practice I never saw this happening for longer than a few seconds, so I don't think it's
a big deal. If there are some workers doing other stuff then we are not going to receive any
aggregator request from them, and will see this kind of log.
> 
> Alessandro Presta wrote:
>     Sorry but it was difficult to see exactly what changed, since it shows it as a completely
new file...
>     The log message looks better now. Is there a way we can wrap it into something that
refers to aggregators?
>     As I understand it, now we log the missing task ids if we haven't received anything
from them. I guess that's ok for most applications (where there are no or just a few aggregators),
but seems problematic in case the application actually uses aggregators heavily. If it's a
simple change, I would make it so that we track workers that either have sent no information
at all, or only part of it. Anyway, this already solves my main issue with troubleshooting
jobs.
> 
> Maja Kabiljo wrote:
>     Yeah, I don't know why the patch didn't realize the file was just renamed.
>     I could change the log line to say that we are waiting for aggregators, but I don't
think it's a good idea since it's misleading - as you know in all the cases when we see lines
about waiting for permits, it's nothing related to aggregators, but it's like that just because
aggregators are last step in the superstep. I already see user mailing list full of questions
like "I'm not using aggregators but my application waits on aggregators..." :-) What do others
think?
>     I'm not sure how complicated it would be to make this print nice logs when we received
something but not everything from a worker, if we find a need for it in the future I can work
on it. Aggregator requests are by default limited to 1MB, so if your aggregator data is less
than numOfWorkers * 1MB there won't be multiple requests. And, as I said, even in our application
which heavily uses aggregators, we still very rarely come in this situation, and it takes
very short time.
> 
> Alessandro Presta wrote:
>     If it's not related to aggregators, then what is it related to? I think the problem
is that we have these log lines even when there are 0 aggregators (which, as you say, can
be confusing).
>     "Permits" is no less misleading than "aggregators", except the user probably doesn't
know what the former means. It would be better if we could skip this whole permits mechanism
when there are no aggregators.
>     I'm ok with the current logic for logging task ids, but I still think the log message
should be more transparent.
>     Anyway up to you. Maybe someone else can comment on this...

I don't want to make a big deal out of a single log message, so I changed it, and added one
log line when the waiting for aggregators part starts. 

All I'm saying whenever we saw this message in practice, it's not because aggregators part
is slow, but because some of the workers is being slow doing the computation / processing
message requests. It's not just about the cases when there are 0 aggregators, in majority
of situations you'll have just one/several aggregator requests - this always get transferred
instantly (compare it to thousands of message requests which we are sending in each of our
applications).

We can't just say skip this part (unless we add a special option which user needs to turn
on in order to use aggregators - but I don't think we want to do that), since workers need
to know somehow whether there are aggregators in the application or not. So we'd need to have
zookeeper involved in one code path, and current solution in the other. And as discussed many
times before, I don't see the benefit of doing so since there is basically no overhead of
current approach. 


- Maja


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10029/#review18105
-----------------------------------------------------------


On March 20, 2013, 3:29 a.m., Maja Kabiljo wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10029/
> -----------------------------------------------------------
> 
> (Updated March 20, 2013, 3:29 a.m.)
> 
> 
> Review request for giraph and Alessandro Presta.
> 
> 
> Description
> -------
> 
> Now we'll be writing which workers are we waiting for.
> 
> 
> This addresses bug GIRAPH-537.
>     https://issues.apache.org/jira/browse/GIRAPH-537
> 
> 
> Diffs
> -----
> 
>   giraph-core/src/main/java/org/apache/giraph/comm/aggregators/AllAggregatorServerData.java
dddd1cb 
>   giraph-core/src/main/java/org/apache/giraph/comm/aggregators/OwnerAggregatorServerData.java
70ff7fe 
>   giraph-core/src/main/java/org/apache/giraph/comm/netty/NettyMasterClient.java 86ea8dc

>   giraph-core/src/main/java/org/apache/giraph/comm/netty/NettyWorkerAggregatorRequestProcessor.java
cd24219 
>   giraph-core/src/main/java/org/apache/giraph/comm/requests/ByteArrayWithSenderTaskIdRequest.java
PRE-CREATION 
>   giraph-core/src/main/java/org/apache/giraph/comm/requests/SendAggregatorsToOwnerRequest.java
21b1b2d 
>   giraph-core/src/main/java/org/apache/giraph/comm/requests/SendAggregatorsToWorkerRequest.java
7e84e17 
>   giraph-core/src/main/java/org/apache/giraph/comm/requests/SendWorkerAggregatorsRequest.java
264f03a 
>   giraph-core/src/main/java/org/apache/giraph/utils/ExpectedBarrier.java ccd137c 
>   giraph-core/src/main/java/org/apache/giraph/utils/TaskIdsPermitsBarrier.java PRE-CREATION

>   giraph-core/src/main/java/org/apache/giraph/worker/WorkerAggregatorHandler.java 001cf59

> 
> Diff: https://reviews.apache.org/r/10029/diff/
> 
> 
> Testing
> -------
> 
> mvn clean install
> AggregatorsBenchmark
> 
> 
> Thanks,
> 
> Maja Kabiljo
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message