giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits
Date Thu, 16 Aug 2012 01:02:38 GMT

     [ https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eli Reisman updated GIRAPH-246:
-------------------------------

    Attachment: GIRAPH-246-NEW-FIX.patch

This is working for us. My ops fella still says we need to call progress more often, but I
have tested this several times now and it seems to get past our timeout filters. To make sure
I wasn't crazy, I ran trunk again after testing this, and it timed out again at 10 min. like
always. The only major change here is:

PredicateLock#waitMsecs() did not call progress during waits at any point, only waitForever()
did by timing out at MSEC_PERIOD. So waits under waitMsecs (such as in INPUT_SUPERSTEP which
was often the fail point for us) were not calling it in idle workers at the barrier.

This patch only calls condition.await() for set slices of the selected timeout during a waitMsecs()
call, and then continues if the time is not used up after calling progress().

This "fix" makes the call from within the lock state which I don't like, but it works, perhaps
because:

1. no one else is calling progress() during these barrier waits or there would be no problem
here to start with, so no problems with thread contention during the calls.

2. calls like progress() are idempotent in that nothing rides on the calls conflicting, as
long as they occur at all.

3. The locking is to protect against asynchronous event state changes, and that is not idempotent,
so we shouldn't break the try block when the timed condition.await() stops to call progress().

If we want to break the try block to call progress at each loop we can, but this is already
working and keeps the event part safe inside the locked code at all times, making minimal
change to an important part of the code that already works correctly.

I was not able to patch in 291 as it is stale, so I wrote a simple test and I figured when
291 is ready, it can replace this code with more comprehensive tests if you like.

If someone else could try this out it would make me happy. But I think its good to go, passes
mvn verify and its new test etc. as well.


                
> Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters
during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch, GIRAPH-246-1.patch, GIRAPH-246-2.patch,
GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, GIRAPH-246-7.patch,
GIRAPH-246-7_rebase1.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch, GIRAPH-246-NEW-FIX.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to control
the time between calls to context().progress() that allows workers to avoid timeouts during
long data load-ins in which some works complete their input split reads much faster than others,
or finish a super step faster. I found this allowed jobs that were large-scale but with low
memory overhead to complete even when they would previously time out during runs on a Hadoop
cluster. Timeout is still possible when the worker crashes or runs out of memory or has other
GC or RPC trouble that is legitimate, but prevents unintentional crashes when the worker is
actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message