ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pavel Tupitsyn <ptupit...@apache.org>
Subject Re: SIGSEGV instead of NullReferenceException when using Ignite.NET
Date Wed, 18 Sep 2019 11:54:42 GMT
Hi Eduard,

First of all, thank you so much for such a detailed report, this is
extremely valuable!
I've updated our troubleshooting guide:
https://apacheignite-net.readme.io/docs/troubleshooting

Yes, JVM installs it's own signal handlers:
https://docs.oracle.com/javase/9/troubleshoot/handle-signals-and-exceptions.htm
This includes SIGSEGV, which is used to handle NullPointerException in
Java, and it conflicts with similar mechanism in .NET.
There is -Xrs option to reduce signal usage, but it does not get rid of
SIGSEGV handler, unfortunately.

As for .NET Core 3.0 - I have it on my machine and I run some Ignite tests
with it time to time.
So far the only issue was with IGNITE_HOME detection with NuGet:
https://issues.apache.org/jira/browse/IGNITE-10554, and it has workarounds
(copy jar files manually or with a build step).
Let me know if you encounter anything else with .NET Core 3.0, we plan to
make the next Ignite release fully compatible with it.

Thanks,
Pavel



On Wed, Sep 18, 2019 at 1:48 PM Eduard Llull <eduard@llull.net> wrote:

> Hi everyone,
>
> Almost a month ago I claimed that one of our application that use the
> Ignite.NET thick client we were getting SIGSEGVs and SIGABRT, and changing
> to the thin client it fixed that problem but the performance was severally
> impacted [
> http://apache-ignite-users.70518.x6.nabble.com/NET-thin-client-multithreaded-td29116.html#a29142].
> We suspected that it was related with the fact that the embedded JVM
> installs it's own signal handlers but we had no evidence.
>
> We have been digging into this problem and today we found the cause. It
> will be a long email.
>
> The reproducer is quite simple:
> using System;
> using Apache.Ignite.Core;
>
> namespace segfault
> {
> class Program
> {
> static void Main(string[] args)
> {
> if (args.Length == 0)
> {
> Console.WriteLine("Starting Ignite");
> var thick = Ignition.Start();
> }
> else
> {
> Console.WriteLine("NOT starting Ignite");
> }
>
> string s = null;
> try
> {
> s.ToUpper();
> }
> catch (NullReferenceException e)
> {
> Console.WriteLine("Catched exception " + e);
> }
> }
> }
> }
>
> If executed as a netcoreapp2.2 application on Linux (tested on ubuntu
> 19.04, I've havent tested it on Windows), and not passing any argument (it
> will call the Ignition.Start()), it crashes.
>
> $ dotnet run
> Starting Ignite
> [12:17:55]    __________  ________________
> [12:17:55]   /  _/ ___/ |/ /  _/_  __/ __/
> [12:17:55]  _/ // (7 7    // /  / / / _/
> [12:17:55] /___/\___/_/|_/___/ /_/ /___/
> [12:17:55]
> [12:17:55] ver. 2.7.5#20190603-sha1:be4f2a15
> [12:17:55] 2018 Copyright(C) Apache Software Foundation
> [12:17:55]
> [12:17:55] Ignite documentation: http://ignite.apache.org
> [12:17:55]
> [12:17:55] Quiet mode.
> [12:17:55]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
> [12:17:55]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false
> or "-v" to ignite.{sh|bat}
> [12:17:55]
> [12:17:55] OS: Linux 5.0.0-25-generic amd64
> [12:17:55] VM information: OpenJDK Runtime Environment
> 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit
> Server VM 25.222-b10
> [12:17:55] Please set system property '-Djava.net.preferIPv4Stack=true' to
> avoid possible problems in mixed environments.
> [12:17:55] Initial heap size is 250MB (should be no less than 512MB, use
> -Xms512m -Xmx512m).
> [12:17:55] Configured plugins:
> [12:17:55]   ^-- None
> [12:17:55]
> [12:17:55] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler
> [tryStop=false, timeout=0, super=AbstractFailureHandler
> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
> [12:17:55] Message queue limit is set to 0 which may lead to potential
> OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due
> to message queues growth on sender and receiver sides.
> [12:17:55] Security status [authentication=off, tls/ssl=off]
> [12:17:57] Performance suggestions for grid  (fix if possible)
> [12:17:57] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
> [12:17:57]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM
> options)
> [12:17:57]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]'
> to JVM options)
> [12:17:57]   ^-- Set max direct memory size if getting 'OOME: Direct
> buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM
> options)
> [12:17:57]   ^-- Disable processing of calls to System.gc() (add
> '-XX:+DisableExplicitGC' to JVM options)
> [12:17:57] Refer to this page for more performance suggestions:
> https://apacheignite.readme.io/docs/jvm-and-system-tuning
> [12:17:57]
> [12:17:57] To start Console Management & Monitoring run
> ignitevisorcmd.{sh|bat}
> [12:17:57] Data Regions Configured:
> [12:17:57]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB,
> persistence=false]
> [12:17:57]
> [12:17:57] Ignite node started OK (id=5dd14995)
> [12:17:57] Topology snapshot [ver=1, locNode=5dd14995, servers=1,
> clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
> **** stack smashing detected ***: <unknown> terminated*
>
>
>
> If executed passing any argument (it won't start Ignite) the captured
> NullReferenceException is printed on the console.
>
> $ dotnet run 1
> NOT starting Ignite
> *Catched exception System.NullReferenceException*: Object reference not
> set to an instance of an object.
>   at segfault.Program.Main(String[] args) in
> /home/eduard/Development/X-files/segfault-2/Program.cs:line 23
>
>
> So, our guess about the signal handlers looked right and it was confirmed
> when we found these issues in the github project of coreclr:
>
>    1. Stack Smashing Failures (SIGSEGV) instead of
>    NullReferenceExceptions [https://github.com/dotnet/coreclr/issues/25166
>    ]
>    2. SIGSEGV is not transformed into NullReferenceException in WSL [
>    https://github.com/dotnet/coreclr/issues/25945]
>
> So, the problem is caused because the NET core CLR uses an alternate stack
> for handling the sigsegv signal, but when the signal handler registered by
> the 3rd party native library (libjvm.so) calls the CLR signal handler it is
> not called with the alternate stack and the CLR signal handler cannot
> handle that case and the program just exits.
>
> It seams solved in the NET core SDK 3.0 (tested executing the application
> as a netcoreapp3.0  with SDK 3.0.100-rc1-014190) but you have to define the
> environment variable COMPlus_EnableAlternateStackCheck=1
> to enable the alternate stack check [
> https://github.com/dotnet/coreclr/issues/25945#issuecomment-517199962]
>
> Without the COMPlus_EnableAlternateStackCheck with NET core 3.0 it
> segfaults:
>
> $ grep netcoreapp segfault-2.csproj; dotnet run
>    <TargetFramework>netcoreapp3.0</TargetFramework>
> Starting Ignite
> [12:33:38]    __________  ________________
> [12:33:38]   /  _/ ___/ |/ /  _/_  __/ __/
> [12:33:38]  _/ // (7 7    // /  / / / _/
> [12:33:38] /___/\___/_/|_/___/ /_/ /___/
> [12:33:38]
> [12:33:38] ver. 2.7.5#20190603-sha1:be4f2a15
> [12:33:38] 2018 Copyright(C) Apache Software Foundation
> [12:33:38]
> [12:33:38] Ignite documentation: http://ignite.apache.org
> [12:33:38]
> [12:33:38] Quiet mode.
> [12:33:38]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
> [12:33:38]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false
> or "-v" to ignite.{sh|bat}
> [12:33:38]
> [12:33:38] OS: Linux 5.0.0-25-generic amd64
> [12:33:38] VM information: OpenJDK Runtime Environment
> 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit
> Server VM 25.222-b10
> [12:33:39] Please set system property '-Djava.net.preferIPv4Stack=true' to
> avoid possible problems in mixed environments.
> [12:33:39] Initial heap size is 250MB (should be no less than 512MB, use
> -Xms512m -Xmx512m).
> [12:33:39] Configured plugins:
> [12:33:39]   ^-- None
> [12:33:39]
> [12:33:39] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler
> [tryStop=false, timeout=0, super=AbstractFailureHandler
> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
> [12:33:39] Message queue limit is set to 0 which may lead to potential
> OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due
> to message queues growth on sender and receiver sides.
> [12:33:39] Security status [authentication=off, tls/ssl=off]
> [12:33:40] Performance suggestions for grid  (fix if possible)
> [12:33:40] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
> [12:33:40]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM
> options)
> [12:33:40]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]'
> to JVM options)
> [12:33:40]   ^-- Set max direct memory size if getting 'OOME: Direct
> buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM
> options)
> [12:33:40]   ^-- Disable processing of calls to System.gc() (add
> '-XX:+DisableExplicitGC' to JVM options)
> [12:33:40] Refer to this page for more performance suggestions:
> https://apacheignite.readme.io/docs/jvm-and-system-tuning
> [12:33:40]
> [12:33:40] To start Console Management & Monitoring run
> ignitevisorcmd.{sh|bat}
> [12:33:40] Data Regions Configured:
> [12:33:40]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB,
> persistence=false]
> [12:33:40]
> [12:33:40] Ignite node started OK (id=711e0976)
> [12:33:40] Topology snapshot [ver=1, locNode=711e0976, servers=1,
> clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
> **** stack smashing detected ***: <unknown> terminated*
>
>
> With the COMPlus_EnableAlternateStackCheck the exception is catched:
>
> $ grep netcoreapp segfault-2.csproj; *COMPlus_EnableAlternateStackCheck=1*
> dotnet run
>    <TargetFramework>netcoreapp3.0</TargetFramework>
> Starting Ignite
> [12:35:20]    __________  ________________
> [12:35:20]   /  _/ ___/ |/ /  _/_  __/ __/
> [12:35:20]  _/ // (7 7    // /  / / / _/
> [12:35:20] /___/\___/_/|_/___/ /_/ /___/
> [12:35:20]
> [12:35:20] ver. 2.7.5#20190603-sha1:be4f2a15
> [12:35:20] 2018 Copyright(C) Apache Software Foundation
> [12:35:20]
> [12:35:20] Ignite documentation: http://ignite.apache.org
> [12:35:20]
> [12:35:20] Quiet mode.
> [12:35:20]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
> [12:35:20]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false
> or "-v" to ignite.{sh|bat}
> [12:35:20]
> [12:35:20] OS: Linux 5.0.0-25-generic amd64
> [12:35:20] VM information: OpenJDK Runtime Environment
> 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit
> Server VM 25.222-b10
> [12:35:20] Please set system property '-Djava.net.preferIPv4Stack=true' to
> avoid possible problems in mixed environments.
> [12:35:20] Initial heap size is 250MB (should be no less than 512MB, use
> -Xms512m -Xmx512m).
> [12:35:21] Configured plugins:
> [12:35:21]   ^-- None
> [12:35:21]
> [12:35:21] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler
> [tryStop=false, timeout=0, super=AbstractFailureHandler
> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
> [12:35:21] Message queue limit is set to 0 which may lead to potential
> OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due
> to message queues growth on sender and receiver sides.
> [12:35:21] Security status [authentication=off, tls/ssl=off]
> [12:35:22] Performance suggestions for grid  (fix if possible)
> [12:35:22] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
> [12:35:22]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM
> options)
> [12:35:22]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]'
> to JVM options)
> [12:35:22]   ^-- Set max direct memory size if getting 'OOME: Direct
> buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM
> options)
> [12:35:22]   ^-- Disable processing of calls to System.gc() (add
> '-XX:+DisableExplicitGC' to JVM options)
> [12:35:22] Refer to this page for more performance suggestions:
> https://apacheignite.readme.io/docs/jvm-and-system-tuning
> [12:35:22]
> [12:35:22] To start Console Management & Monitoring run
> ignitevisorcmd.{sh|bat}
> [12:35:22] Data Regions Configured:
> [12:35:22]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB,
> persistence=false]
> [12:35:22]
> [12:35:22] Ignite node started OK (id=841d9bca)
> [12:35:22] Topology snapshot [ver=1, locNode=841d9bca, servers=1,
> clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
> C*atched exception System.NullReferenceException*: Object reference not
> set to an instance of an object.
>   at segfault.Program.Main(String[] args) in
> /home/eduard/Development/X-files/segfault-2/Program.cs:line 23
>
>
> Our plan is to change our application to use the NET Core 3.0 and the
> thick client. We know that it is currently in RC but it's expected to be
> released on 23th of September and as we will be performing in depth tests
> to see if there is anything that breaks we expect that the 3.0 will be
> release by the time we decide to deploy it in production.
>
> So, first of all, I wanted to let you know about this issue in case any
> body gets in the same situation.
>
> And finally, do you guys foresee any problem with the migration?
>
>

Mime
View raw message