ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valentin Kulichenko <valentin.kuliche...@gmail.com>
Subject Re: How to retry failed job on any node with Apache Ignite
Date Mon, 29 Jun 2015 21:16:06 GMT
Alexey,

I duplicated the response on stack overflow as well.

Thanks!

On Mon, Jun 29, 2015 at 1:31 PM, Aleksei Valikov <aleksei.valikov@gmail.com>
wrote:

> Hi,
>
> thanks for the quick response. So it seems I didn't miss something obvious.
>
> Thank you!
>
> Best wishes,
> Alexey
>
> On Mon, Jun 29, 2015 at 10:28 PM, Valentin Kulichenko <
> valentin.kulichenko@gmail.com> wrote:
>
>> Alexey,
>>
>> I see your point and it really looks like your use case should be an
>> option of AlwaysFailoverSpi (which is the default one). But now it doesn't
>> failover if it has already tried all nodes for a particular job. So you
>> will have to implement your own failover SPI (it should be pretty simple -
>> just pick a random node from the topology each time a job is trying to fail
>> over).
>>
>> As for global nature of the SPI, you're right, but its failover() takes
>> FailoverContext, which has information about failed job (task name,
>> attributes, exception, etc.), so you can make decision based on this
>> information.
>>
>> Hope this helps.
>>
>> Thanks!
>>
>> On Mon, Jun 29, 2015 at 1:08 PM, Aleksei Valikov <
>> aleksei.valikov@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> this is basically a copy of
>>>
>>>
>>> http://stackoverflow.com/questions/31124341/how-to-retry-failed-job-on-any-node-with-apache-ignite-gridgain
>>>
>>> I'm experimenting with fault tolerance
>>> <https://apacheignite.readme.io/v1.1/docs/fault-tolerance> in Apache
>>> Ignite.
>>>
>>> What I can't figure out is how to retry a failed job on any node. I have
>>> a use case where my jobs will be calling a third-party tool as a system
>>> process via process buildr to do some calculations. In some cases the tool
>>> may fail, but in most cases it's OK to retry the job on any node -
>>> including the one where it previously failed.
>>>
>>> At the moment Ignite seems to reroute the job to another node which did
>>> not have this job before. So, after a while all nodes are gone and the task
>>> fails.
>>>
>>> What I'm looking for is how to retry a job on any node.
>>>
>>> Here's a test to demonstrate my problem.
>>>
>>> Here's my randomly failing job:
>>>
>>> public static class RandomlyFailingComputeJob implements ComputeJob {
>>>     private static final long serialVersionUID = -8351095134107406874L;
>>>     private final String data;
>>>
>>>     public RandomlyFailingComputeJob(String data) {
>>>         Validate.notNull(data);
>>>         this.data = data;
>>>     }
>>>
>>>     public void cancel() {
>>>     }
>>>
>>>     public Object execute() throws IgniteException {
>>>         final double random = Math.random();
>>>         if (random > 0.5) {
>>>             throw new IgniteException();
>>>         } else {
>>>             return StringUtils.reverse(data);
>>>         }
>>>     }}
>>>
>>> An below is the task:
>>>
>>> public static class RandomlyFailingComputeTask extends
>>>         ComputeTaskSplitAdapter<String, String> {
>>>     private static final long serialVersionUID = 6756691331287458885L;
>>>
>>>     @Override
>>>     public ComputeJobResultPolicy result(ComputeJobResult res,
>>>             List<ComputeJobResult> rcvd) throws IgniteException {
>>>         if (res.getException() != null) {
>>>             return ComputeJobResultPolicy.FAILOVER;
>>>         }
>>>         return ComputeJobResultPolicy.WAIT;
>>>     }
>>>
>>>     public String reduce(List<ComputeJobResult> results)
>>>             throws IgniteException {
>>>         final Collection<String> reducedResults = new ArrayList<String>(
>>>                 results.size());
>>>         for (ComputeJobResult result : results) {
>>>             reducedResults.add(result.<String> getData());
>>>         }
>>>         return StringUtils.join(reducedResults, ' ');
>>>     }
>>>
>>>     @Override
>>>     protected Collection<? extends ComputeJob> split(int gridSize,
>>>             String arg) throws IgniteException {
>>>         final String[] args = StringUtils.split(arg, ' ');
>>>         final Collection<ComputeJob> computeJobs = new ArrayList<ComputeJob>(
>>>                 args.length);
>>>         for (String data : args) {
>>>             computeJobs.add(new RandomlyFailingComputeJob(data));
>>>         }
>>>         return computeJobs;
>>>     }
>>> }
>>>
>>> Test code:
>>>
>>>     final Ignite ignite = Ignition.start();
>>>     final String original = "The quick brown fox jumps over the lazy dog";
>>>
>>>     final String reversed = StringUtils.join(
>>>             ignite.compute().execute(new RandomlyFailingComputeTask(),
>>>                     original), ' ');
>>>
>>> As you can see, should always be failovered. Since the probability of
>>> failure != 1, I expect the task to successfully terminate at some point.
>>>
>>> With the probability threshold of 0.5 and a total of 3 nodes this hardly
>>> happens. I'm getting an exception like class
>>> org.apache.ignite.cluster.ClusterTopologyException: Failed to failover a
>>> job to another node (failover SPI returned null). After some debugging
>>> I've found out that this is because I eventually run out of nodes. All of
>>> the are gone.
>>>
>>> I understand that I can write my own FailoverSpi to handle this.
>>>
>>> But this just doesn't feel right.
>>>
>>> First, it seems to be an overkill to do this.
>>> But then the SPI is a kind of global thing. I'd like to decide per job
>>> if it should be retried or failed over. This may, for instance, depend on
>>> what the exit code of the third-party tool I'm invoking. So configuring
>>> failover over the global SPI isn't right.
>>>
>>> I'd appreciate any pointers.
>>>
>>> Many thanks and best wishes,
>>>
>>> Alexey
>>>
>>
>>
>

Mime
View raw message