flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin Pullin (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (FLINK-8309) JVM sigsegv crash when enabling async checkpoints
Date Fri, 22 Dec 2017 19:30:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-8309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Kevin Pullin closed FLINK-8309.
-------------------------------
    Resolution: Resolved

> JVM sigsegv crash when enabling async checkpoints
> -------------------------------------------------
>
>                 Key: FLINK-8309
>                 URL: https://issues.apache.org/jira/browse/FLINK-8309
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.4.0, 1.3.2
>         Environment: macOS 10.13.2 & Ubuntu 16.04.03 using JVM 1.8.0_151.
>            Reporter: Kevin Pullin
>         Attachments: StreamingJob.scala
>
>
> h4. Summary
> I have a streaming job with async checkpointing enabled. The job is crashing the JVM
with a SIGSEGV error coinciding with checkpoint completion.
> Workarounds are noted below. I thought this was worth documenting in case someone runs
into similar issues or if a fix is possible.
> h4. Job Overview & Observations
> The job itself stores a large quantity of `case class` objects in `valueState`s contained
within a `RichFilterFunction`. This data is used for deduplicating events.
> The crash stops by:
>  - moving the case class outside of the anonymous RichFilterFunction class.
>  - reducing the number of objects stored in the valueState.
>  - reducing the size of the objects stored in the valueState.
>  - disabling async snapshots.
> I can provide additional crash data as needed (core dumps, error logs, etc).  The StateBackend
implementation doesn't matter; the job fails using the Memory, Fs, and RocksDb backends.
> From what I understand anonymous classes should be avoided with checkpointing as the
name isn't stable, so that seems like the best route for me.
> h4. Reproduction case
> The attached a `StreamingJob.scala` file that contains a minimal repo case, which closely
aligns with my actual job configuration.  Running it consistently crashes the JVM upon completion
of the first checkpoint.
> My tests runs set only two JVM options => -Xms4g -Xmx4g
> h4. Crash output
> Here's a crash captured from Ubuntu:
> {noformat}
> [info] #
> [info] # A fatal error has been detected by the Java Runtime Environment:
> [info] #
> [info] #  SIGSEGV (0xb) at pc=0x00007fd192b92c1c, pid=7191, tid=0x00007fd0873f3700
> [info] #
> [info] # JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
> [info] # Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 compressed
oops)
> [info] # Problematic frame:
> [info] # C  [libzip.so+0x5c1c]
> [info] #
> [info] # Core dump written. Default location: /home/ubuntu/flink-project/core or core.7191
> [info] #
> [info] # An error report file with more information is saved as:
> [info] # /home/XXX/flink-project/hs_err_pid7191.log
> [info] Compiled method (nm)   71547   81     n 0       java.util.zip.ZipFile::getEntry
(native)
> [info]  total in heap  [0x00007fd17d12e290,0x00007fd17d12e600] = 880
> [info]  relocation     [0x00007fd17d12e3b8,0x00007fd17d12e400] = 72
> [info]  main code      [0x00007fd17d12e400,0x00007fd17d12e600] = 512
> [info] #
> [info] # If you would like to submit a bug report, please visit:
> [info] #   http://bugreport.java.com/bugreport/crash.jsp
> [info] #
> {noformat}
> And one from macOS:
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x0000000105264c48, pid=30848, tid=0x0000000000003403
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_151-b12) (build 1.8.0_151-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode bsd-amd64 compressed
oops)
> # Problematic frame:
> # V  [libjvm.dylib+0x464c48]
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try
"ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /Users/XXX/src/etc_flink_mwx/hs_err_pid30848.log
> [thread 30211 also had an error]
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> # The crash happened outside the Java Virtual Machine in native code.
> # See problematic frame for where to report the bug.
> #
> Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message