arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philipp Moritz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ARROW-263) Design an initial IPC mechanism for Arrow Vectors
Date Fri, 19 Aug 2016 05:02:20 GMT

    [ https://issues.apache.org/jira/browse/ARROW-263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427632#comment-15427632
] 

Philipp Moritz commented on ARROW-263:
--------------------------------------

Hey Micah,

thanks for your insights, seems we have been thinking along similar lines. I wrote one shared
memory object store based on boost::IPC, which under the hood uses shm_open, one based on
Chrome's Mojo IPC system (which is great but pulls in a lot of dependencies) and currently
I am writing a new very lightweight one based on POSIX. I'm happy to share my findings:

Answering your in-depth analysis questions:
1. This is essentially the design of boost::IPC and this design makes it hard to ensure that
named shared objects are removed properly (they cannot be removed automatically by the OS).
If the process that is supposed to clean them up crashes, you will need to delete them manually.

2. Using mmaped based APIs is the way to go in my opionion. To create the file, you create
a temporary file of size 0, keep the file descriptor around, unlink the file from the file
system, and manipulate the file by mmaping it from the file descriptor. If you want to share
the file with another process, you pass the file descriptor over a Unix Domain Socket. On
Windows there also seem to be ways of doing this.

3. I also found the guarantees given by the JVM too restrictive. One way to do it which is
working for me is to use JNI and to wrap some native C code. The alternative is to use JVM
memory mapped files anyways and to hope they do the right thing (which they seem to do on
linux), and keep the file underlying the memory around (with similar problems as 1).

You find my implementation here: https://github.com/pcmoritz/plasma, it essentially implements
what you describe in *. I'd love to get your feedback, and if we can share some code that
would be even better. I also think that Mojo (Google Chrome's new IPC layer) got this "open
file of size 0, unlinke the file, keep the file descriptor around" approach working on Windows,
so this should be ok.

The central question is then if people are ok with using JNI for java. I have some code to
do this which I'm happy to share with you in the next couple of days.

All the best,
Philipp.

> Design an initial IPC mechanism for Arrow Vectors
> -------------------------------------------------
>
>                 Key: ARROW-263
>                 URL: https://issues.apache.org/jira/browse/ARROW-263
>             Project: Apache Arrow
>          Issue Type: New Feature
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>
> Prior discussion on this topic [1].
> Use-cases:
> 1.  User defined function (UDF) execution:  One process wants to execute a user defined
function written in another language (e.g. Java executing a function defined in python, this
involves creating Arrow Arrays in java, sending them to python and receiving a new set of
Arrow Arrays produced in python back in the java process).
> 2.  If a storage system and a query engine are running on the same host we might want
use IPC instead of RPC (e.g. Apache Drill querying Apache Kudu)
> Assumptions:
> 1.  IPC mechanism should be useable from the core set of supported languages (Java, Python,
C) on POSIX and ideally windows systems.  Ideally, we would not need to add dependencies on
additional libraries outside of each languages outside of this document.
> We want leverage shared memory for Arrays to avoid doubling RAM requirements by duplicating
the same Array in different memory locations.  
> 2. Under some circumstances shared memory might be more efficient than FIFOs or sockets
(in other scenarios they won’t see thread below).
> 3. Security is not a concern for V1, we assume all processes running are “trusted”.
> Requirements:
> 1.Resource management: 
>     a.  Both processes need a way of allocating memory for Arrow Arrays so that data
can be passed from one process to another.
>     b. There must be a mechanism to cleanup unused Arrow Arrays to limit resource usage
but avoid race conditions when processing arrays
> 2.  Schema negotiation - before sending data, both processes need to agree on schema
each one will produce.
> Out of scope requirements:
> 1.  IPC channel metadata discovery is out of scope of this document.  Discovery can be
provided by passing appropriate command line arguments, configuration files or other mechanisms
like RPC (in which case RPC channel discovery is still an issue).
> [1] http://mail-archives.apache.org/mod_mbox/arrow-dev/201603.mbox/%3C8D5F7E3237B3ED47B84CF187BB17B666148E7322@SHSMSX103.ccr.corp.intel.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message