arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: RecordBatch.length vs. Buffer.length?
Date Mon, 06 May 2019 13:47:17 GMT
hi Jeffrey,

The sizing of each Buffer can vary significantly depending on what the
schema is. For example, Binary or List have variable element sizes and
so their buffers will also.

I'm not sure about the exact details in the Java library but there
should be some integrity verification whether the vectors belonging to
a record batch all have the same length. If there is not, and it is
possible to send with the IPC protocol invalid record batches, can you
please open a JIRA issue? We have a RecordBatch::Validate method in
C++ to be able to check for this

https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.cc#L224

Thanks

On Fri, May 3, 2019 at 9:56 PM Jeffrey Green <jeffrey.n.green@gmail.com> wrote:
>
> Hello!
>
> I'm using the Java API for Arrow and am finding some ambiguity between the
> length field in a RecordBatch and the "byte-width-adjusted" length field in
> a Buffer.
>
> As per https://arrow.apache.org/docs/format/Metadata.html under the "Record
> data headers" section:
> "A record batch is a collection of top-level named, equal length Arrow
> arrays (or vectors)."
>
> This seems to correspond to org.apache.arrow.flatbuf.RecordBatch.length()
> when reading and VectorSchemaRoot.setRowCount() when writing.  In addition
> to this field, each array buffer has its own specific length in bytes.
>
> As a library developer (particularly on the consumer side), what is the
> proper behavior when these two numbers don't match or when array lengths
> don't match each other?  For example, I can use the ArrowFileWriter to
> create a two-column file where I setRowCount to 8, add 100 ints to the
> first column and 300 ints to the second column and everything seems to
> "work" fine even though this doesn't seem to be internally consistent.
>
> If these various length fields are supposed to correspond to each other /
> represent the same thing, then having two different accounts of the same
> value seems error-prone and ambiguous.  Why does the format not exclusively
> use RecordBatch.length combined with each array's bitWidth?  The product of
> the two seems like it should be equivalent to Buffer.length.
>
> As such, I think I must be missing something and am looking for more
> clarity on how to think about and process RecordBatch.length and
> Buffer.length (once I divide by bytesPerElement).
>
> Thanks.

Mime
View raw message