ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleksei Valikov <aleksei.vali...@gmail.com>
Subject How to retry failed job on any node with Apache Ignite
Date Mon, 29 Jun 2015 20:08:21 GMT
Hi,

this is basically a copy of

http://stackoverflow.com/questions/31124341/how-to-retry-failed-job-on-any-node-with-apache-ignite-gridgain

I'm experimenting with fault tolerance
<https://apacheignite.readme.io/v1.1/docs/fault-tolerance> in Apache Ignite.

What I can't figure out is how to retry a failed job on any node. I have a
use case where my jobs will be calling a third-party tool as a system
process via process buildr to do some calculations. In some cases the tool
may fail, but in most cases it's OK to retry the job on any node -
including the one where it previously failed.

At the moment Ignite seems to reroute the job to another node which did not
have this job before. So, after a while all nodes are gone and the task
fails.

What I'm looking for is how to retry a job on any node.

Here's a test to demonstrate my problem.

Here's my randomly failing job:

public static class RandomlyFailingComputeJob implements ComputeJob {
    private static final long serialVersionUID = -8351095134107406874L;
    private final String data;

    public RandomlyFailingComputeJob(String data) {
        Validate.notNull(data);
        this.data = data;
    }

    public void cancel() {
    }

    public Object execute() throws IgniteException {
        final double random = Math.random();
        if (random > 0.5) {
            throw new IgniteException();
        } else {
            return StringUtils.reverse(data);
        }
    }}

An below is the task:

public static class RandomlyFailingComputeTask extends
        ComputeTaskSplitAdapter<String, String> {
    private static final long serialVersionUID = 6756691331287458885L;

    @Override
    public ComputeJobResultPolicy result(ComputeJobResult res,
            List<ComputeJobResult> rcvd) throws IgniteException {
        if (res.getException() != null) {
            return ComputeJobResultPolicy.FAILOVER;
        }
        return ComputeJobResultPolicy.WAIT;
    }

    public String reduce(List<ComputeJobResult> results)
            throws IgniteException {
        final Collection<String> reducedResults = new ArrayList<String>(
                results.size());
        for (ComputeJobResult result : results) {
            reducedResults.add(result.<String> getData());
        }
        return StringUtils.join(reducedResults, ' ');
    }

    @Override
    protected Collection<? extends ComputeJob> split(int gridSize,
            String arg) throws IgniteException {
        final String[] args = StringUtils.split(arg, ' ');
        final Collection<ComputeJob> computeJobs = new ArrayList<ComputeJob>(
                args.length);
        for (String data : args) {
            computeJobs.add(new RandomlyFailingComputeJob(data));
        }
        return computeJobs;
    }
}

Test code:

    final Ignite ignite = Ignition.start();
    final String original = "The quick brown fox jumps over the lazy dog";

    final String reversed = StringUtils.join(
            ignite.compute().execute(new RandomlyFailingComputeTask(),
                    original), ' ');

As you can see, should always be failovered. Since the probability of
failure != 1, I expect the task to successfully terminate at some point.

With the probability threshold of 0.5 and a total of 3 nodes this hardly
happens. I'm getting an exception like class
org.apache.ignite.cluster.ClusterTopologyException: Failed to failover a
job to another node (failover SPI returned null). After some debugging I've
found out that this is because I eventually run out of nodes. All of the
are gone.

I understand that I can write my own FailoverSpi to handle this.

But this just doesn't feel right.

First, it seems to be an overkill to do this.
But then the SPI is a kind of global thing. I'd like to decide per job if
it should be retried or failed over. This may, for instance, depend on what
the exit code of the third-party tool I'm invoking. So configuring failover
over the global SPI isn't right.

I'd appreciate any pointers.

Many thanks and best wishes,

Alexey

Mime
View raw message