arrow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pcmor...@apache.org
Subject arrow git commit: ARROW-1381: [Python] Use FixedSizeBufferWriter in SerializedPyObject.to_buffer
Date Wed, 30 Aug 2017 21:58:35 GMT
Repository: arrow
Updated Branches:
  refs/heads/master d8c651ce2 -> f45002552


ARROW-1381: [Python] Use FixedSizeBufferWriter in SerializedPyObject.to_buffer

With this setup:

```
import numpy as np
import pyarrow as pa

objects = [np.random.randn(500, 500) for i in range(400)]
serialized = pa.serialize(objects)
```

I have before:

```
In [3]: %timeit buf = serialized.to_buffer()
201 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

and after:

```
In [4]: %timeit buf = serialized.to_buffer()
81.1 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

I added an `nthreads` option but note that when the objects are small, multithreading makes
things slower due to the overhead of launching threads. I think the 1MB threshold in `arrow/io/memory.cc`
may be too small, we might do some benchmarking to find a better default crossover point for
switching between parallel and serial memcpy:

```
In [2]: %timeit buf = serialized.to_buffer(nthreads=4)
134 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

cc @pcmoritz @robertnishihara

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #1017 from wesm/ARROW-1381 and squashes the following commits:

fbd0028 [Wes McKinney] Add unit test for SerializedPyObject.to_buffer
ab85230 [Wes McKinney] Add nthreads option for turning on multithreaded memcpy
db12072 [Wes McKinney] Use FixedSizeBufferWriter in SerializedPyObject.to_buffer


Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/f4500255
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/f4500255
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/f4500255

Branch: refs/heads/master
Commit: f4500255220fd1a4d68579e8b088fe3f315897de
Parents: d8c651c
Author: Wes McKinney <wes.mckinney@twosigma.com>
Authored: Wed Aug 30 14:58:09 2017 -0700
Committer: Philipp Moritz <pcmoritz@gmail.com>
Committed: Wed Aug 30 14:58:09 2017 -0700

----------------------------------------------------------------------
 python/pyarrow/serialization.pxi           | 9 ++++++---
 python/pyarrow/tests/test_serialization.py | 8 ++++++++
 2 files changed, 14 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow/blob/f4500255/python/pyarrow/serialization.pxi
----------------------------------------------------------------------
diff --git a/python/pyarrow/serialization.pxi b/python/pyarrow/serialization.pxi
index dc9bdaf..5c0fbc6 100644
--- a/python/pyarrow/serialization.pxi
+++ b/python/pyarrow/serialization.pxi
@@ -181,13 +181,16 @@ cdef class SerializedPyObject:
         # also unpack the list the object was wrapped in in serialize
         return PyObject_to_object(result)[0]
 
-    def to_buffer(self):
+    def to_buffer(self, nthreads=1):
         """
         Write serialized data as Buffer
         """
-        sink = BufferOutputStream()
+        cdef Buffer output = allocate_buffer(self.total_bytes)
+        sink = FixedSizeBufferWriter(output)
+        if nthreads > 1:
+            sink.set_memcopy_threads(nthreads)
         self.write_to(sink)
-        return sink.get_result()
+        return output
 
 
 def serialize(object value, SerializationContext context=None):

http://git-wip-us.apache.org/repos/asf/arrow/blob/f4500255/python/pyarrow/tests/test_serialization.py
----------------------------------------------------------------------
diff --git a/python/pyarrow/tests/test_serialization.py b/python/pyarrow/tests/test_serialization.py
index d922576..4e98bd5 100644
--- a/python/pyarrow/tests/test_serialization.py
+++ b/python/pyarrow/tests/test_serialization.py
@@ -237,6 +237,14 @@ def test_primitive_serialization(large_memory_map):
             serialization_roundtrip(obj, mmap)
 
 
+def test_serialize_to_buffer():
+    for nthreads in [1, 4]:
+        for value in COMPLEX_OBJECTS:
+            buf = pa.serialize(value).to_buffer(nthreads=nthreads)
+            result = pa.deserialize(buf)
+            assert_equal(value, result)
+
+
 def test_complex_serialization(large_memory_map):
     with pa.memory_map(large_memory_map, mode="r+") as mmap:
         for obj in COMPLEX_OBJECTS:


Mime
View raw message