giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits
Date Fri, 17 Aug 2012 22:25:38 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437114#comment-13437114
] 

Eli Reisman commented on GIRAPH-246:
------------------------------------

I think I have tracked down the issues to something internal that was going on while I was
doing the runs earlier, repeated re-tests on our rebase that has worked for a month and the
246-NEW-FIX-2 that keeps all the predicate lock code from Jaeho's fix as described earlier
work as normal again. Given all the weirdness on the cluster the last few days, lets wait
on doing any thing with these until monday and I will run many more jobs over the weekend
to make absolutely certain whatever was going on is internal and not a problem with the patches
for sure. I should have know when our rebase failed today something more odd than Giraph problems
was happening. But let me make absolutely certain today and this weekend before going any
further. Sorry about all the comments here, I was under the impression this might get committed
today after the earlier testing success and i wanted to make sure that didn't happen until
I tracked this problem down when I got here today and neither patch was behaving properly
any more.

Jaeho, you had mentioned wanting to try the extra thread option. Since I'll be running these
two patches a lot this weekend, if you want to put some code up on that here or under your
other post on this issue, I'd be happy to try it out for you if you like.

                
> Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters
during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch, GIRAPH-246-1.patch, GIRAPH-246-2.patch,
GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, GIRAPH-246-7.patch,
GIRAPH-246-7_rebase1.patch, GIRAPH-246-7_rebase2.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch,
GIRAPH-246-NEW-FIX-2.patch, GIRAPH-246-NEW-FIX.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to control
the time between calls to context().progress() that allows workers to avoid timeouts during
long data load-ins in which some works complete their input split reads much faster than others,
or finish a super step faster. I found this allowed jobs that were large-scale but with low
memory overhead to complete even when they would previously time out during runs on a Hadoop
cluster. Timeout is still possible when the worker crashes or runs out of memory or has other
GC or RPC trouble that is legitimate, but prevents unintentional crashes when the worker is
actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message