arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yevgeni Litvin <selit...@gmail.com>
Subject Table of tensors with Arrow
Date Tue, 23 Oct 2018 05:45:19 GMT
In Petastorm we operate with tables of tensors. We are trying to map this
data structure into
Arrow's primitives. One way is to use pa.array of BinaryValue type while
using
FixedSizeBufferWriter to serialize a pa.Tensor type into it and deserialize
it on read. This
feels somewhat ackward and I guess does not achieve the zero-copy
behavior(?)

This is what we do to deserialize the tensor from a single binary value:

        buffer = value.as_py()
        reader = pa.BufferReader(memoryview(buffer))
        tensor = pa.read_tensor(reader)
        n = tensor.to_numpy()


And this is how a numpy array is serialized into a BinaryValue written to a
parquet store:

        tensor = pa.Tensor.from_numpy(array)
        buffer = pa.allocate_buffer(pa.get_tensor_size(tensor))
        stream = pa.FixedSizeBufferWriter(buffer)
        pa.write_tensor(tensor, stream)
        bytes = bytearray(buffer.to_pybytes())

Is there a better, more Arrow native approach, to model our data?

Thanks!

- Yevgeni

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message