arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emilio Lahr-Vivaz <>
Subject Re: buffer alignment (format/java/js)
Date Tue, 08 Aug 2017 16:35:11 GMT
So I think the issue is that we are serializing record batches in a 
distributed fashion, and then concatenating them in the streaming 
format. However, the message serialization only aligns the start of the 
buffers, which requires it to know the current absolute offset of the 
output stream. Would there be any problem with padding the end of the 
message, so any single serialized record batch would always be a 
multiple of 8 bytes?

I've put together a branch that does this, and the existing java tests 
all pass. I'm having some trouble running the integration tests though.



On 08/08/2017 09:18 AM, Emilio Lahr-Vivaz wrote:
> Hi Wes,
> You're right, I just realized that. I think the alignment issue might 
> be in some unrelated code, actually. From what I can tell the the 
> arrow writers are aligning buffers correctly; if not I'll open a bug.
> Thanks,
> Emilio
> On 08/08/2017 09:15 AM, Wes McKinney wrote:
>> hi Emilio,
>>  From your description, it isn't clear why 8-byte alignment is causing
>> a problem (as compare with 64-byte alignment). My understanding is
>> that JavaScript's TypedArray classes range in size from 1 to 8 bytes.
>> The starting offset for all buffers should be 8-byte aligned, if not
>> that is a bug. Could you clarify?
>> - Wes
>> On Tue, Aug 8, 2017 at 8:52 AM, Emilio Lahr-Vivaz 
>> <> wrote:
>>> After looking at it further, I think only the buffers themselves 
>>> need to be
>>> aligned, not the metadata and/or schema. Would there be any problem 
>>> with
>>> changing the alignment to 64 bytes then?
>>> Thanks,
>>> Emilio
>>> On 08/08/2017 08:08 AM, Emilio Lahr-Vivaz wrote:
>>>> I'm looking into buffer alignment in the java writer classes. 
>>>> Currently
>>>> some files written with the java streaming writer can't be read due 
>>>> to the
>>>> javascript TypedArray's restriction that the start offset of the 
>>>> array must
>>>> be a multiple of the data size of the array type (i.e. Int32Vectors 
>>>> must
>>>> start on a multiple of 4, Float64Vectors must start on a multiple 
>>>> of 8,
>>>> etc). From a cursory look at the java writer, I believe that the 
>>>> schema that
>>>> is written first is not aligned at all, and then each record batch 
>>>> pads out
>>>> its size to a multiple of 8. So:
>>>> 1. should the schema block pad itself so that the first record 
>>>> batch is
>>>> aligned, and is there any problem with doing so?
>>>> 2. is there any problem with changing the alignment to 64 bytes, as
>>>> recommended (but not required) by the spec?
>>>> Thanks,
>>>> Emilio

View raw message