giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits
Date Wed, 08 Aug 2012 16:53:22 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431218#comment-13431218
] 

Eli Reisman commented on GIRAPH-246:
------------------------------------

Yeah the marriage patch is crapping out on me so far :(. 

Jaeho, don't apologize, its not your fault or your problem, this is Hadoop and Giraph not
getting along. I am in no way trying to assume I can guess why this is happening or why the
timeouts still occur. Giraph's points of contact with Hadoop are pain points sometimes, there's
lots to do around interfacing better with the Hadoop infrastructure. I think you're going
to be very satisfied when you apply this sort of thinking to other parts of Giraph and the
code will run beautifully. Its not all like this, I swear!

So 246-8 and 246-9 are probably suspect. I think 246-7 is that last rebase of the revert code
that I got to run, but I need to verify it. My goal at this point is getting the timeouts
to disappear while we open a window to solve this problem without racing the clock. I'm willing
to attempt tests on any/all solution patches people have today, so I'm taking numbers now,
speak up!

I should not have tried to figure out the predicate lock solution myself at the last minute,
but everyone wants to keep that code in and I want to stop the timeouts, and I was hoping
we could have our cake and eat it too. If the solution is leaving this alone and I patch in
the old patch and keep rebasing it for a while for the users here, thats perfectly fine with
me, I'm just glad I we can start testing application code (and extending the scale out!) why
don't you guys decide how to move forward with this and I'll work around it as I need to.
If you decide to patch in some part of this code and need me to clean it up, I'm happy to
do that too.

I wish everyone on this project had the opportunity I've had to really ramp this thing up
the last few months on a big cluster and see what it can do. If I told you I'd have kill you,
but you'd die smiling. :)

As I told Jakob recently, most of the "bottlenecks" I have discovered while attempting scale
out have been bug fixes not overhauls, I think you would be really proud to know how close
this thing is to being a powerful bulk processing tool right now today. It hasn't been hard
for me to evangelize about this project around here, people are ready for a solution like
this. Its very exciting stuff.

In short, this sort of messy frustration with Giraph is the exception not the norm in my mind,
and I hope for Jaeho or anyone new getting involved they will recognize that. Its no accident
we are out of incubator, this thing is no toy. Kudos to all of you.

                
> Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters
during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch, GIRAPH-246-4.patch,
GIRAPH-246-5.patch, GIRAPH-246-6.patch, GIRAPH-246-7.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to control
the time between calls to context().progress() that allows workers to avoid timeouts during
long data load-ins in which some works complete their input split reads much faster than others,
or finish a super step faster. I found this allowed jobs that were large-scale but with low
memory overhead to complete even when they would previously time out during runs on a Hadoop
cluster. Timeout is still possible when the worker crashes or runs out of memory or has other
GC or RPC trouble that is legitimate, but prevents unintentional crashes when the worker is
actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message