drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinfeng Ni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5376) Rationalize Drill's row structure for simpler code, better performance
Date Fri, 24 Mar 2017 15:06:41 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940507#comment-15940507

Jinfeng Ni commented on DRILL-5376:

If that's what you intention, does it make sense to change the title of JIRA to reflect that?
Otherwise, people may get confused that you are proposing a fundamental change in Drill's
execution, by adopting row-based approach for data representation. 

> Rationalize Drill's row structure for simpler code, better performance
> ----------------------------------------------------------------------
>                 Key: DRILL-5376
>                 URL: https://issues.apache.org/jira/browse/DRILL-5376
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
> Drill is a columnar system, but data is ultimately represented as rows (AKA records or
tuples.) The way that Drill represents rows leads to excessive code complexity and runtime
> Data in Drill is stored in vectors: one (or more) per column. Vectors do not stand alone,
however, they are "bundled" into various forms of grouping: the {{VectorContainer}}, {{RecordBatch}},
{{VectorAccessible}}, {{VectorAccessibleSerializable}}, and more. Each has slightly different
semantics, requiring large amounts of code to bridge between the representations.
> Consider only a simple row: one with only scalar columns. In classic relational theory,
such a row is a tuple:
> {code}
> R = (a, b, c, d, ...)
> {code}
> A tuple is defined as an ordered list of column values. Unlike a list or array, the column
values also have names and may have varying data types.
> In SQL, columns are referenced by either position or name. In most execution engines,
columns are referenced by position (since positions, in most systems, cannot change.) A 1:1
mapping is provided between names and positions. (See the JDBC {{RecordSet}} interface.)
> This allows code to be very fast: code references columns by index, not by name, avoiding
name lookups for each column reference.
> Drill provides a murky, hybrid approach. Some structures ({{BatchSchema}}, for example)
appear to provide a fixed column ordering, allowing indexed column access. But, other abstractions
provide only an iterator. Others (such as {{VectorContainer}}) provides name-based access
or, by clever programming, indexed access.
> As a result, it is never clear exactly how to quickly access a column: by name, by name
to multi-part index to vector?
> Of course, Drill also supports maps, which add to the complexity. First, we must understand
that a "map" in Drill is not a "map" in the classic sense: it is not a collection of (name,
value) pairs in the JSON sense: a collection in which each instance may have a different set
of pairs.
> Instead, in Drill, a "map" is really a nested tuple: a map has the same structure as
a Drill record: a collection of names and values in which all rows have the same structure.
(This is so because maps are really a collection of value vectors, and the vectors cut across
all rows.)
> Drill, however, does not reflect this symmetry: that a row and a map are both tuples.
There are no common abstractions for the two. Instead, maps are represented as a {{MapVector}}
that contains a (name, vector) map for its children.
> Because of this name-based mapping, high-speed indexed access to vectors is not provided
"out of the box." Certainly each consumer of a map can build its own indexing mechanism. But,
this leads to code complexity and redundancy.
> This ticket asks to rationalize Drill's row, map and schema abstractions around the tuple
concept. A schema is a description of a tuple and should (as in JDBC) provide both name and
index based access. That is, provide methods of the form:
> {code}
> MaterializedField getField(int index);
> MaterializedField getField(String name);
> ...
> ValueVector getVector(int index);
> ValueVector getVector(String name);
> {code}
> Provide a common abstraction for rows and maps, recognizing their structural similarity.
> There is an obvious issue with indexing columns in a row when the row contains maps.
Should indexing be multi-part (index into row, then into map) as today? A better alternative
is to provide a flattened interface:
> {code}
> 0: a, 1: b.x, 2: b.y, 3: c, ...
> {code}
> Use this change to simplify client code, over time, to use a simple indexed-based column

This message was sent by Atlassian JIRA

View raw message