Not sure about why repair is running, but we are also seeing the same merkle tree issue in a mixed version cluster in which we have intentionally started a repair against 2 upgraded DCs. We are currently researching, and can post back if we find the issue, but also would appreciate if someone has a suggestion. We have also run a local repair in an upgraded DC in this same mixed version cluster without issue.

We are going 2.1.x to 3.0.x... and yes, we know you are not supposed to run repairs in mixed version clusters, so don't do it :) this is kind of a special circumstances where other things have gone wrong.

Thanks

On Wed, Jun 5, 2019, 5:23 PM shalom sagges <shalomsagges@gmail.com> wrote:
If anyone has any idea on what might cause this issue, it'd be great.

I don't understand what could trigger this exception.
But what I really can't understand is why repairs started to run suddenly :-\
There's no cron job running, no active repair process, no Validation compactions, Reaper is turned off....  I see repair running only in the logs.

Thanks!


On Wed, Jun 5, 2019 at 2:32 PM shalom sagges <shalomsagges@gmail.com> wrote:
Hi All,

I'm having a bad situation where after upgrading 2 nodes (binaries only) from 2.1.21 to 3.11.4 I'm getting a lot of warnings as follows:

AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[ReadStage-5,5,main]: {}
java.lang.ArrayIndexOutOfBoundsException: null


I also see errors on repairs but no repair is running at all. I verified this with ps -ef command and nodetool compactionstats. The error I see is:
Failed creating a merkle tree for [repair #a95498f0-8783-11e9-b065-81cdbc6bee08 on system_auth/users, []], /1.2.3.4 (see log for details)

I saw repair errors on data tables as well.
nodetool status shows all are UN and nodetool describecluster shows two schema versions as expected.


After the warnings appeared, clients started to get timed out read/write queries.
Restarting the 2 nodes solved the clients' connection issues, but the warnings are still being generated in the logs.

Did anyone encounter such an issue and knows what this means?

Thanks!