ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Mashenkov <andrey.mashen...@gmail.com>
Subject Re: Segmentation fault (JVM crash) while memory restoring on start with native persistance
Date Mon, 15 Jan 2018 14:50:19 GMT
Hi Arseny,

Have you success with reproducing the issue and getting stacktrace?
Do you observe same behavior on OracleJDK?

On Tue, Dec 26, 2017 at 2:43 PM, Andrey Mashenkov <
andrey.mashenkov@gmail.com> wrote:

> Hi Arseny,
>
> This looks like a known issues that is unresolved yet [1],
> but we can't sure it is same issue as there is no stacktrace in logs
> attached.
>
>
> [1] https://issues.apache.org/jira/browse/IGNITE-7278
>
> On Tue, Dec 26, 2017 at 12:54 PM, Arseny Kovalchuk <
> arseny.kovalchuk@synesis.ru> wrote:
>
>> Hi guys.
>>
>> We've successfully tested Ignite as in-memory solution, it showed
>> acceptable performance. But we cannot get stable work of Ignite cluster
>> with native persistence enabled. Our first error we've got is Segmentation
>> fault (JVM crash) while memory restoring on start.
>>
>> [2017-12-22 11:11:51,992]  INFO [exchange-worker-#46%ignite-instance-0%]
>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>> - Read checkpoint status [startMarker=/ignite-work-dire
>> ctory/db/ignite_instance_0/cp/1513938154201-8c574131-763d-
>> 4cfa-99b6-0ce0321d61ab-START.bin, endMarker=/ignite-work-directo
>> ry/db/ignite_instance_0/cp/1513932413840-55ea1713-8e9e-
>> 44cd-b51a-fcad8fb94de1-END.bin]
>> [2017-12-22 11:11:51,993]  INFO [exchange-worker-#46%ignite-instance-0%]
>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>> - Checking memory state [lastValidPos=FileWALPointer [idx=391,
>> fileOffset=220593830, len=19573, forceFlush=false],
>> lastMarked=FileWALPointer [idx=394, fileOffset=38532201, len=19573,
>> forceFlush=false], lastCheckpointId=8c574131-763d-4cfa-99b6-0ce0321d61ab]
>> [2017-12-22 11:11:51,993]  WARN [exchange-worker-#46%ignite-instance-0%]
>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>> - Ignite node stopped in the middle of checkpoint. Will restore memory
>> state and finish checkpoint on node start.
>> [CodeBlob (0x00007f9b58f24110)]
>> Framesize: 0
>> BufferBlob (0x00007f9b58f24110) used for StubRoutines (2)
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  Internal Error (sharedRuntime.cpp:842), pid=221, tid=0x00007f9b473c1ae8
>> #  fatal error: exception happened outside interpreter, nmethods and
>> vtable stubs at pc 0x00007f9b58f248f6
>> #
>> # JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build
>> 1.8.0_151-b12)
>> # Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64
>> compressed oops)
>> # Derivative: IcedTea 3.6.0
>> # Distribution: Custom build (Tue Nov 21 11:22:36 GMT 2017)
>> # Core dump written. Default location: /opt/ignite/core or core.221
>> #
>> # An error report file with more information is saved as:
>> # /ignite-work-directory/core_dump_221.log
>> #
>> # If you would like to submit a bug report, please include
>> # instructions on how to reproduce the bug and visit:
>> #   http://icedtea.classpath.org/bugzilla
>> #
>>
>>
>>
>> Please find logs and configs attached.
>>
>> We deploy Ignite along with our services in Kubernetes (v 1.8) on
>> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite
>> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD.
>>
>> We put about 230 events/second into Ignite, 70% of events are ~200KB in
>> size and 30% are 5000KB. Smaller events have indexed fields and we query
>> them via SQL.
>>
>> The cluster is activated from a client node which also streams events
>> into Ignite from Kafka. We use custom implementation of streamer which uses
>> cache.putAll() API.
>>
>> We got the error when we stopped and restarted cluster again. It happened
>> only on one instance.
>>
>> The general question is:
>>
>> *Is it possible to tune up (or implement) native persistence in a way
>> when it just reports about error in data or corrupted data, then skip it
>> and continue to work without that corrupted part. Thus it will make the
>> cluster to continue operating regardless of errors on storage?*
>>
>>
>> ​
>> Arseny Kovalchuk
>>
>> Senior Software Engineer at Synesis
>> skype: arseny.kovalchuk
>> mobile: +375 (29) 666-16-16
>> ​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​
>>
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>



-- 
Best regards,
Andrey V. Mashenkov

Mime
View raw message