Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 46D142004F1 for ; Wed, 30 Aug 2017 23:58:40 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 4595A16A160; Wed, 30 Aug 2017 21:58:40 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8C2AE16A15E for ; Wed, 30 Aug 2017 23:58:39 +0200 (CEST) Received: (qmail 32358 invoked by uid 500); 30 Aug 2017 21:58:37 -0000 Mailing-List: contact commits-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list commits@arrow.apache.org Received: (qmail 32349 invoked by uid 99); 30 Aug 2017 21:58:37 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Aug 2017 21:58:37 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 906B0E115A; Wed, 30 Aug 2017 21:58:35 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: pcmoritz@apache.org To: commits@arrow.apache.org Message-Id: <9e34c798b25d4c87a90f8dcc8cfd5296@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: arrow git commit: ARROW-1381: [Python] Use FixedSizeBufferWriter in SerializedPyObject.to_buffer Date: Wed, 30 Aug 2017 21:58:35 +0000 (UTC) archived-at: Wed, 30 Aug 2017 21:58:40 -0000 Repository: arrow Updated Branches: refs/heads/master d8c651ce2 -> f45002552 ARROW-1381: [Python] Use FixedSizeBufferWriter in SerializedPyObject.to_buffer With this setup: ``` import numpy as np import pyarrow as pa objects = [np.random.randn(500, 500) for i in range(400)] serialized = pa.serialize(objects) ``` I have before: ``` In [3]: %timeit buf = serialized.to_buffer() 201 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` and after: ``` In [4]: %timeit buf = serialized.to_buffer() 81.1 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` I added an `nthreads` option but note that when the objects are small, multithreading makes things slower due to the overhead of launching threads. I think the 1MB threshold in `arrow/io/memory.cc` may be too small, we might do some benchmarking to find a better default crossover point for switching between parallel and serial memcpy: ``` In [2]: %timeit buf = serialized.to_buffer(nthreads=4) 134 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` cc @pcmoritz @robertnishihara Author: Wes McKinney Closes #1017 from wesm/ARROW-1381 and squashes the following commits: fbd0028 [Wes McKinney] Add unit test for SerializedPyObject.to_buffer ab85230 [Wes McKinney] Add nthreads option for turning on multithreaded memcpy db12072 [Wes McKinney] Use FixedSizeBufferWriter in SerializedPyObject.to_buffer Project: http://git-wip-us.apache.org/repos/asf/arrow/repo Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/f4500255 Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/f4500255 Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/f4500255 Branch: refs/heads/master Commit: f4500255220fd1a4d68579e8b088fe3f315897de Parents: d8c651c Author: Wes McKinney Authored: Wed Aug 30 14:58:09 2017 -0700 Committer: Philipp Moritz Committed: Wed Aug 30 14:58:09 2017 -0700 ---------------------------------------------------------------------- python/pyarrow/serialization.pxi | 9 ++++++--- python/pyarrow/tests/test_serialization.py | 8 ++++++++ 2 files changed, 14 insertions(+), 3 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/arrow/blob/f4500255/python/pyarrow/serialization.pxi ---------------------------------------------------------------------- diff --git a/python/pyarrow/serialization.pxi b/python/pyarrow/serialization.pxi index dc9bdaf..5c0fbc6 100644 --- a/python/pyarrow/serialization.pxi +++ b/python/pyarrow/serialization.pxi @@ -181,13 +181,16 @@ cdef class SerializedPyObject: # also unpack the list the object was wrapped in in serialize return PyObject_to_object(result)[0] - def to_buffer(self): + def to_buffer(self, nthreads=1): """ Write serialized data as Buffer """ - sink = BufferOutputStream() + cdef Buffer output = allocate_buffer(self.total_bytes) + sink = FixedSizeBufferWriter(output) + if nthreads > 1: + sink.set_memcopy_threads(nthreads) self.write_to(sink) - return sink.get_result() + return output def serialize(object value, SerializationContext context=None): http://git-wip-us.apache.org/repos/asf/arrow/blob/f4500255/python/pyarrow/tests/test_serialization.py ---------------------------------------------------------------------- diff --git a/python/pyarrow/tests/test_serialization.py b/python/pyarrow/tests/test_serialization.py index d922576..4e98bd5 100644 --- a/python/pyarrow/tests/test_serialization.py +++ b/python/pyarrow/tests/test_serialization.py @@ -237,6 +237,14 @@ def test_primitive_serialization(large_memory_map): serialization_roundtrip(obj, mmap) +def test_serialize_to_buffer(): + for nthreads in [1, 4]: + for value in COMPLEX_OBJECTS: + buf = pa.serialize(value).to_buffer(nthreads=nthreads) + result = pa.deserialize(buf) + assert_equal(value, result) + + def test_complex_serialization(large_memory_map): with pa.memory_map(large_memory_map, mode="r+") as mmap: for obj in COMPLEX_OBJECTS: