Mailing-List: contact users-help@camel.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@camel.apache.org
MIME-Version: 1.0
In-Reply-To: <00d301d316a1$d365db90$7a3192b0$@hm-ag.de>
References: <00d301d316a1$d365db90$7a3192b0$@hm-ag.de>
From: Zoran Regvart <zoran@regvart.com>
Date: Sun, 20 Aug 2017 00:52:25 +0200
Message-ID: <CABD_Zr8G56k+rMQyeHTqn9P=bmyvtcOXcioOrkiUd0FnJ-cPiQ@mail.gmail.com>
Subject: Re: Race Condition in Aggregation using HazelcastAggregationRepository
 in a cluster
To: users@camel.apache.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
archived-at: Sat, 19 Aug 2017 22:52:56 -0000

Hi Michael,
it's a bit hard to follow so I could be misunderstanding your issue;
is your issue that there is a race condition between the aggregator
that expects the reply on node A and another aggregator that is not
aware of the initial request on node B?

If you're doing only request-reply correlation perhaps take a look at
InOut message exchange pattern with a correlation property[1] with the
replying application setting the ReplyToQMgr to the requester's queue
manager.

Or, place the reply in a Hazelcast queue regardless of the queue
manager the reply landed on and process the reply from there.

Also I think that it would be better to setup the reply coordination
expectation (with timeouts and without transactions -- that would
block) before sending the message.

2c

[1] https://camel.apache.org/correlation-identifier.html

On Wed, Aug 16, 2017 at 5:10 PM, Michael L=C3=BCck <michael.lueck@hm-ag.de>=
 wrote:
> Hi there,
>
> we just had an issue in one of our systems and it looks like there is an
> issue with locking in the AggregateProcessor in a
> distributed environment.
>
> I=E2=80=99ll try to explain it:
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
> Scenario
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
>
> We use camel-core and camel-hazelcast 2.16.5  and hazelcast 3.5.2
>
> We have a route which sends a message to an Websphere MQ Queue (via
> JMSComponent) and after that we put
> the message into an aggregator which uses the JMSCorrelationId to correla=
te
> the request and the response.
>
> from(epAggregation)
>   .aggregate(header("JMSCorrelationID"), new CustomAggregationStrategy())
>
> .completionTimeout(Integer.parseInt(getContext().resolvePropertyPlacehold=
ers
> ("{{timeout}}")))
>   .completionSize(2)
>   .aggregationRepository(aggrRepo)
>
> The aggregationRepository aggrRepo is created like this
>   HazelcastAggregationRepository  aggrRepo =3D new
> HazelcastAggregationRepository ("aggrRepoDrsBatch", hcInst));
> where hcInst is an Instance of com.hazelcast.core.HazelcastInstance.
>
> We also have another route which reads the response from the response que=
ue
> and forwards it to the aggregator.
>
> The environment consists of two nodes on which the same code is running (=
so
> essentially the send and response routes
> and the aggregation)
>
> The problem arises when the response is returned really fast and is consu=
med
> on the node that didn't sent the response.
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Analysis
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> I digged a bit in the camel code and it seems to me that the problem here=
 is
> the lock in the AggregateProcessor as it is local
> to the VM in which the code runs. I'll try for an example to make this mo=
re
> clear:
>
> - Node A sends a MQ message and after that it puts the message into the
> aggregator. The AggregateProcessor runs and
>   checks the lock before going into doAggregation()
>         - in doAggregation it tries to get the Exchange from the reposito=
ry
> and doesn't find any. So it continues to aggregate
>                 the first message an writes this into the repository
> - In about the same time between reading the exchange from the repository
> and before writing the "aggregated" first
>   message into the repository Node B fetches the reply from the response
> queue and sends it to the aggregator. As in node A
>   the lock is checked and as the code runs on another VM the lock is free
> and the AggregateProcessor can go to doAggregation
>         - in doAggregation the Node tries to get the Exchange from the
> repository before the other node has written it.
>           And like Node A the code proceeds with creating the first Excha=
nge
> for the aggregation and writes in into the
>                 repository.
>
> The result is that one of the nodes will override the Exchange the other
> created before. And the Aggreagtion will never
> complete (actually it does but because of the timeout)
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Ideas to solve the problem
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> - probably optimistic locking is an option here as
> HazelcastAggregationRepository supports this by implementing
>   OptimisticLockingAggregationRepository
> =3D> I'd like to hear your thoughts on this.
>
> - currently we can stop the route consuming from the response route on on=
e
> Node to eliminate the error. But this is not
>   an option for a long time because we lose the ability for fail over
> - probably it's an idea to make the AggregateProcessor get the Lock Objec=
t
> from the repository. So for example for the
>   HazelcastAggregationRepository the repository can return the lock objec=
t
> for the hazelcast map which would lock it for the
>   whole cluster.
> - I thought about resending the MQ message in case of an timeout but as t=
he
> request has side effects on the system that
>   processes the message this is not really an option.
>
> So I hope I could make myself clear,
> If you have any questions which would help you to help me, I'd happy to
> answer them.
>
> Regards,
> Michael
>
>
>


--=20
Zoran Regvart