arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeffrey Green <jeffrey.n.gr...@gmail.com>
Subject RecordBatch.length vs. Buffer.length?
Date Sat, 04 May 2019 02:55:46 GMT
Hello!

I'm using the Java API for Arrow and am finding some ambiguity between the
length field in a RecordBatch and the "byte-width-adjusted" length field in
a Buffer.

As per https://arrow.apache.org/docs/format/Metadata.html under the "Record
data headers" section:
"A record batch is a collection of top-level named, equal length Arrow
arrays (or vectors)."

This seems to correspond to org.apache.arrow.flatbuf.RecordBatch.length()
when reading and VectorSchemaRoot.setRowCount() when writing.  In addition
to this field, each array buffer has its own specific length in bytes.

As a library developer (particularly on the consumer side), what is the
proper behavior when these two numbers don't match or when array lengths
don't match each other?  For example, I can use the ArrowFileWriter to
create a two-column file where I setRowCount to 8, add 100 ints to the
first column and 300 ints to the second column and everything seems to
"work" fine even though this doesn't seem to be internally consistent.

If these various length fields are supposed to correspond to each other /
represent the same thing, then having two different accounts of the same
value seems error-prone and ambiguous.  Why does the format not exclusively
use RecordBatch.length combined with each array's bitWidth?  The product of
the two seems like it should be equivalent to Buffer.length.

As such, I think I must be missing something and am looking for more
clarity on how to think about and process RecordBatch.length and
Buffer.length (once I divide by bytesPerElement).

Thanks.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message