giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-274) Jobs still failing due to tasks timeout during INPUT_SUPERSTEP
Date Thu, 09 Aug 2012 16:27:19 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431955#comment-13431955
] 

Eli Reisman commented on GIRAPH-274:
------------------------------------

This whole this is so weird and unpleasant. Please realize whenever I babble about this issue
(now and in the past) it is to pick seemingly useless nits to try to figure out why one works
or the other doesn't, it doesn't make any sense. With that in mind:

- The versions of Hadoop and the scale we run are different, that could be it. I can't go
into detail on that count here.

- The ONLY two things fundamentally different in the patches, as I recall, is that in your
waitMsecs() you do not end up calling progress, but after the period of timeout before re-looping
in waitForever() you do. Perhaps an outside-the-lock-tryblock call to progress in the waitMsecs
would help.

Also, I notices during rebase I lowered the timeout to 30 seconds. While this is silly at
first since the timeout is 10 minutes, until recently I did not know that Mapper.Context was
logging progress calls and actually only going out on the wire occasionally to report "real"
progress. This might mean that we need to call it more frequently than is intuitive in order
to get a "real" outgoing call to Hadoop during those long barrier waits? I will take a better
look at the Hadoop end of this ASAP to check the actual ratio of calls to real signals on
the wire, but I suspect this could be part of the issue. In this case, simply a shorter timeout
of 20-30 seconds seemed to solve the problem for us here, and perhaps it was because we ended
up calling "real" progress out to the wire just often enough per 10 min. window by timing
out that often.

Again, this is all grasping at straws. If any of it gives you an idea, run with it please.
I'd like to see this go away I'm surprised this issue has taken on such a life of its own.
I have a brief window here where more large scale tests are not happening, so for the next
few days I doubt I will have the chance to try to verify any of the alternatives we have all
posted here and on 246, so that sort of blocks up progress from my end as well. For all I
know one of these alternatives might really work, as of now only the old patch and the more
recent rebase (274-alt-1, aka 246-7) have been tested and work as they used to.

                
> Jobs still failing due to tasks timeout during INPUT_SUPERSTEP
> --------------------------------------------------------------
>
>                 Key: GIRAPH-274
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-274
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>            Reporter: Jaeho Shin
>            Assignee: Jaeho Shin
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-274-alt-1.patch, GIRAPH-274.patch
>
>
> Even after GIRAPH-267, jobs were failing during INPUT_SUPERSTEP when some workers don't
get to reserve an input split, while others were loading vertices for a long time.  (related
to GIRAPH-246 and GIRAPH-267)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message