drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Altekruse <altekruseja...@gmail.com>
Subject [DISCUSS] Renaming the RecordBatch interface
Date Sat, 06 Dec 2014 01:33:39 GMT
Hello Drillers,

I am currently working on trying to write documentation to describe our
current interface and implementation patterns used in RecordBatch and its
subclasses. These classes currently contain the implementations of all of
our physical operators, subclasses include FilterRecordBatch, HashAggBatch,
etc.

This naming convention has been a point of confusion for many developers as
they get up to speed on Drill and begin to piece together the control flow
of a query. The name "RecordBatch" implies that the class is logically a
data structure, that holds a batch of records.

During execution, each downsteam operator (which implements the RecordBatch
interface) will be able to access all of the data in the current batches
(the actual data structure) from the operator(s) immediately preceding it.
In this sense, calling this class a RecordBatch is not entirely inaccurate,
as it is providing a reference into the current data.

The place where it gets confusing, is that it does not just hold data. Each
RecordBatch has a next() method, which is used to retrieve the next batch
of records (the data structure). The way this is possible is that the data
is shared with consumers of the interface in the form of a vector container
object, which wraps value vectors. A call to next will swap out the data in
the vector containers with new data.

I was talking with a few members of the dev team about this problem and we
were all in agreement that the interface and its implementations should be
renamed. We tried to talk further about the overall model and decided that
some refactoring/ encapsulation may come along with this re-naming as we
clarify these concepts.

I would like to propose the beginning of this discussion with our
candidates for new names of the interface. The three that stood out for us
were BatchIterator, BatchStream, and BatchCursor. These all represent a
logical wrapper around data that will be accessed by a consumer over time,
and will be accessed in discrete chunks at some level. Each has existing
conventions that define them, and some might be more appropriate than
others for the current implementation used by Drill.

Please share your thoughts on the best possible new name for RecordBatch.

Thanks,
Jason

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message