arrow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From w...@apache.org
Subject arrow git commit: ARROW-1173: [Plasma] Add blog post describing Plasma object store
Date Tue, 08 Aug 2017 02:50:43 GMT
Repository: arrow
Updated Branches:
  refs/heads/master 66ab6b261 -> 03dcce446


ARROW-1173: [Plasma] Add blog post describing Plasma object store

Author: Robert Nishihara <robertnishihara@gmail.com>

Closes #940 from robertnishihara/plasmablogpost and squashes the following commits:

d7230930 [Robert Nishihara] Update blog post date.
48c9c7b9 [Robert Nishihara] Change speedup after improving baseline.
2ae1d66e [Robert Nishihara] Add blog post describing Plasma object store.


Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/03dcce44
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/03dcce44
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/03dcce44

Branch: refs/heads/master
Commit: 03dcce44671f355b3d259b913fcabace609a9cd2
Parents: 66ab6b2
Author: Robert Nishihara <robertnishihara@gmail.com>
Authored: Mon Aug 7 22:50:38 2017 -0400
Committer: Wes McKinney <wes.mckinney@twosigma.com>
Committed: Mon Aug 7 22:50:38 2017 -0400

----------------------------------------------------------------------
 .../2017-08-08-plasma-in-memory-object-store.md | 150 +++++++++++++++++++
 1 file changed, 150 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow/blob/03dcce44/site/_posts/2017-08-08-plasma-in-memory-object-store.md
----------------------------------------------------------------------
diff --git a/site/_posts/2017-08-08-plasma-in-memory-object-store.md b/site/_posts/2017-08-08-plasma-in-memory-object-store.md
new file mode 100644
index 0000000..48cfb66
--- /dev/null
+++ b/site/_posts/2017-08-08-plasma-in-memory-object-store.md
@@ -0,0 +1,150 @@
+---
+layout: post
+title: "Plasma In-Memory Object Store"
+date: "2017-08-08 00:00:00 -0400"
+author: Philipp Moritz and Robert Nishihara
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+*[Philipp Moritz][1] and [Robert Nishihara][2] are graduate students at UC
+ Berkeley.*
+
+## Plasma: A High-Performance Shared-Memory Object Store
+
+### Motivating Plasma
+
+This blog post presents Plasma, an in-memory object store that is being
+developed as part of Apache Arrow. **Plasma holds immutable objects in shared
+memory so that they can be accessed efficiently by many clients across process
+boundaries.** In light of the trend toward larger and larger multicore machines,
+Plasma enables critical performance optimizations in the big data regime.
+
+Plasma was initially developed as part of [Ray][3], and has recently been moved
+to Apache Arrow in the hopes that it will be broadly useful.
+
+One of the goals of Apache Arrow is to serve as a common data layer enabling
+zero-copy data exchange between multiple frameworks. A key component of this
+vision is the use of off-heap memory management (via Plasma) for storing and
+sharing Arrow-serialized objects between applications.
+
+**Expensive serialization and deserialization as well as data copying are a
+common performance bottleneck in distributed computing.** For example, a
+Python-based execution framework that wishes to distribute computation across
+multiple Python “worker” processes and then aggregate the results in a single
+“driver” process may choose to serialize data using the built-in `pickle`
+library. Assuming one Python process per core, each worker process would have to
+copy and deserialize the data, resulting in excessive memory usage. The driver
+process would then have to deserialize results from each of the workers,
+resulting in a bottleneck.
+
+Using Plasma plus Arrow, the data being operated on would be placed in the
+Plasma store once, and all of the workers would read the data without copying or
+deserializing it (the workers would map the relevant region of memory into their
+own address spaces). The workers would then put the results of their computation
+back into the Plasma store, which the driver could then read and aggregate
+without copying or deserializing the data.
+
+### The Plasma API:
+
+Below we illustrate a subset of the API. The C++ API is documented more fully
+[here][6], and the Python API is documented [here][7].
+
+**Object IDs:** Each object is associated with a string of bytes.
+
+**Creating an object:** Objects are stored in Plasma in two stages. First, the
+object store *creates* the object by allocating a buffer for it. At this point,
+the client can write to the buffer and construct the object within the allocated
+buffer. When the client is done, the client *seals* the buffer making the object
+immutable and making it available to other Plasma clients.
+
+```python
+# Create an object.
+object_id = pyarrow.plasma.ObjectID(20 * b'a')
+object_size = 1000
+buffer = memoryview(client.create(object_id, object_size))
+
+# Write to the buffer.
+for i in range(1000):
+    buffer[i] = 0
+
+# Seal the object making it immutable and available to other clients.
+client.seal(object_id)
+```
+
+**Getting an object:** After an object has been sealed, any client who knows the
+object ID can get the object.
+
+```python
+# Get the object from the store. This blocks until the object has been sealed.
+object_id = pyarrow.plasma.ObjectID(20 * b'a')
+[buff] = client.get([object_id])
+buffer = memoryview(buff)
+```
+
+If the object has not been sealed yet, then the call to `client.get` will block
+until the object has been sealed.
+
+### A sorting application
+
+To illustrate the benefits of Plasma, we demonstrate an **11x speedup** (on a
+machine with 20 physical cores) for sorting a large pandas DataFrame (one
+billion entries). The baseline is the built-in pandas sort function, which sorts
+the DataFrame in 477 seconds. To leverage multiple cores, we implement the
+following standard distributed sorting scheme.
+
+* We assume that the data is partitioned across K pandas DataFrames and that
+  each one already lives in the Plasma store.
+* We subsample the data, sort the subsampled data, and use the result to define
+  L non-overlapping buckets.
+* For each of the K data partitions and each of the L buckets, we find the
+  subset of the data partition that falls in the bucket, and we sort that
+  subset.
+* For each of the L buckets, we gather all of the K sorted subsets that fall in
+  that bucket.
+* For each of the L buckets, we merge the corresponding K sorted subsets.
+* We turn each bucket into a pandas DataFrame and place it in the Plasma store.
+
+Using this scheme, we can sort the DataFrame (the data starts and ends in the
+Plasma store), in 44 seconds, giving an 11x speedup over the baseline.
+
+### Design
+
+The Plasma store runs as a separate process. It is written in C++ and is
+designed as a single-threaded event loop based on the [Redis][4] event loop library.
+The plasma client library can be linked into applications. Clients communicate
+with the Plasma store via messages serialized using [Google Flatbuffers][5].
+
+### Call for contributions
+
+Plasma is a work in progress, and the API is currently unstable. Today Plasma is
+primarily used in [Ray][3] as an in-memory cache for Arrow serialized objects.
+We are looking for a broader set of use cases to help refine Plasma’s API. In
+addition, we are looking for contributions in a variety of areas including
+improving performance and building other language bindings. Please let us know
+if you are interested in getting involved with the project.
+
+[1]: https://people.eecs.berkeley.edu/~pcmoritz/
+[2]: http://www.robertnishihara.com
+[3]: https://github.com/ray-project/ray
+[4]: https://redis.io/
+[5]: https://google.github.io/flatbuffers/
+[6]: https://github.com/apache/arrow/blob/master/cpp/apidoc/tutorials/plasma.md
+[7]: https://github.com/apache/arrow/blob/master/python/doc/source/plasma.rst


Mime
View raw message