nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Christianson <achristian...@hortonworks.com>
Subject Re: [DISCUSS] Increasing durability in MiNiFi C++
Date Tue, 01 Aug 2017 14:32:42 GMT
In addition to the tickets mentioned, we probably want to do is isolate custom processors as
much as possible. I.e. if a custom processor segfaults, we probably don’t want that to bring
down the entire minifi process. Achieving that type of isolation might come with some tradeoffs,
though. For instance, we may need to implement process-level isolation, similar to how the
chromium browser isolates tab processes, but doing so would come with additional memory and
IPC overhead. Maybe there are some modern sandboxing techniques we can look at.

Something to consider.

On 8/1/17, 9:59 AM, "Marc" <phrocker@apache.org> wrote:

    Good Morning,
    
      I've begun capturing some details in a ticket for durability and
    reliability of MiNiFi C++ clients [1]. The scope of this ticket is
    continuing operations despite failure within specific components. There is
    a linked ticket [2] attempts to address some of the concerns brought up in
    MINIFI-356, focusing no memory usage.
    
      The spirit of the ticket was meant to capture conditions of known
    failure; however, given that more discussion has blossomed, I'd like to
    assess the experience of the mailing list. Continuing operations in any
    environment is difficult, particularly one in which we likely have little
    to no control. Simply gathering information to know when a failure is
    occurring is a major part of the battle. According to the tickets, there
    needs to be some discussion of how we classify failure.
    
      The ticket addressed the low hanging fruit, but there are certainly more
    conditions of failure. If a disk switches to read/write mode, disks becomes
    full and/or out of inode entries etc, we know a complete failure occurred
    and thus can switch our type of write activity to use a volatile repo. I
    recognize that partial failures may occur, but how do we classify these?
    Should we classify these at all or would this be venturing into a rabbit
    hole?
    
       For memory we can likely throttle queue sizes as needed. For networking
    and other components we could likely find other measures of failure. The
    goal, no matter the component, is to continue operations without human
    intervention -- with the hope that the configuration makes the bounds of
    the client obvious.
    
       My gut reaction is to separate partial failure as the low hanging fruit
    of complete failure is much easier to address, but would love to hear the
    reaction of this list. Further, any input on the types of failures to
    address would be appreciated. Look forward to any and all responses.
    
      Best Regards,
      Marc
    
    [1] https://issues.apache.org/jira/browse/MINIFI-356
    [2] https://issues.apache.org/jira/browse/MINIFI-360
    

Mime
View raw message