beam-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reuven Lax <re...@google.com>
Subject Re: [DISCUSS] Portability representation of schemas
Date Tue, 04 Jun 2019 20:21:28 GMT
On Tue, Jun 4, 2019 at 9:20 AM Brian Hulette <bhulette@google.com> wrote:

>
>
> On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax <relax@google.com> wrote:
>
>>
>>
>> On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette <bhulette@google.com>
>> wrote:
>>
>>> > It has to go into the proto somewhere (since that's the only way the
>>> SDK can get it), but I'm not sure they should be considered integral parts
>>> of the type.
>>> Are you just advocating for an approach where any SDK-specific
>>> information is stored outside of the Schema message itself so that Schema
>>> really does just represent the type? That seems reasonable to me, and
>>> alleviates my concerns about how this applies to columnar encodings a bit
>>> as well.
>>>
>>
>> Yes, that's exactly what I'm advocating.
>>
>>
>>>
>>> We could lift all of the LogicalTypeConversion messages out of the
>>> Schema and the LogicalType like this:
>>>
>>> message SchemaCoder {
>>>   Schema schema = 1;
>>>   LogicalTypeConversion root_conversion = 2;
>>>   map<string, LogicalTypeConversion> attribute_conversions = 3; // only
>>> necessary for user type aliases, portable logical types by definition have
>>> nothing SDK-specific
>>> }
>>>
>>
>> I'm not sure what the map is for? I think we have status quo wihtout it.
>>
>
> My intention was that the SDK-specific information (to/from functions) for
> any nested fields that are themselves user type aliases would be stored in
> this map. That was the motivation for my next question, if we don't allow
> user types to be nested within other user types we may not need it.
>

Oh, is this meant to contain the ids of all the logical types in this
schema? If so I don't think SchemaCoder is the right place for this. Any
"registry" of logical types should be global to the pipeline, not scoped to
a single PCollection IMO.


> I may be missing your meaning - but I think we currently only have status
> quo without this map in the Java SDK because Schema.LogicalType is just an
> interface that must be implemented. It's appropriate for just portable
> logical types, not user-type aliases. Note I've adopted Kenn's terminology
> where portable logical type is a type that can be identified by just a URN
> and maybe some parameters, while a user type alias needs some SDK specific
> information, like a class and to/from UDFs.
>
>
>>
>>> I think a critical question (that has implications for the above
>>> proposal) is how/if the two different concepts Kenn mentioned are allowed
>>> to nest. For example, you could argue it's redundant to have a user type
>>> alias that has a Row representation with a field that is itself a user type
>>> alias, because instead you could just have a single top-level type alias
>>> with to/from functions that pack and unpack the entire hierarchy. On the
>>> other hand, I think it does make sense for a user type alias or a truly
>>> portable logical type to have a field that is itself a truly portable
>>> logical type (e.g. a user type alias or portable type with a DateTime).
>>>
>>> I've been assuming that user-type aliases could be nested, but should we
>>> disallow that? Or should we go the other way and require that logical types
>>> define at most one "level"?
>>>
>>
>> No I think it's useful to allow things to be nested (though of course the
>> nesting must terminate).
>>
>
>>
>>>
>>> Brian
>>>
>>> On Mon, Jun 3, 2019 at 11:08 AM Kenneth Knowles <kenn@apache.org> wrote:
>>>
>>>>
>>>> On Mon, Jun 3, 2019 at 10:53 AM Reuven Lax <relax@google.com> wrote:
>>>>
>>>>> So I feel a bit leery about making the to/from functions a fundamental
>>>>> part of the portability representation. In my mind, that is very tied to a
>>>>> specific SDK/language. A SDK (say the Java SDK) wants to allow users to use
>>>>> a wide variety of native types with schemas, and under the covers uses the
>>>>> to/from functions to implement that. However from the portable Beam
>>>>> perspective, the schema itself should be the real "type" of the
>>>>> PCollection; the to/from methods are simply a way that a particular SDK
>>>>> makes schemas easier to use. It has to go into the proto somewhere (since
>>>>> that's the only way the SDK can get it), but I'm not sure they should be
>>>>> considered integral parts of the type.
>>>>>
>>>>
>>>> On the doc in a couple places this distinction was made:
>>>>
>>>> * For truly portable logical types, no instructions for the SDK are
>>>> needed. Instead, they require:
>>>>    - URN: a standardized identifier any SDK can recognize
>>>>    - A spec: what is the universe of values in this type?
>>>>    - A representation: how is it represented in built-in types? This is
>>>> how SDKs who do not know/care about the URN will process it
>>>>    - (optional): SDKs choose preferred SDK-specific types to embed the
>>>> values in. SDKs have to know about the URN and choose for themselves.
>>>>
>>>> *For user-level type aliases, written as convenience by the user in
>>>> their pipeline, what Java schemas have today:
>>>>    - to/from UDFs: the code is SDK-specific
>>>>    - some representation of the intended type (like java class): also
>>>> SDK specific
>>>>    - a representation
>>>>    - any "id" is just like other ids in the pipeline, just avoiding
>>>> duplicating the proto
>>>>    - Luke points out that nesting these can give multiple SDKs a hint
>>>>
>>>> In my mind the remaining complexity is whether or not we need to be
>>>> able to move between the two. Composite PTransforms, for example, do have
>>>> fluidity between being strictly user-defined versus portable URN+payload.
>>>> But it requires lots of engineering, namely the current work on expansion
>>>> service.
>>>>
>>>> Kenn
>>>>
>>>>
>>>>> On Mon, Jun 3, 2019 at 10:23 AM Brian Hulette <bhulette@google.com>
>>>>> wrote:
>>>>>
>>>>>> Ah I see, I didn't realize that. Then I suppose we'll need to/from
>>>>>> functions somewhere in the logical type conversion to preserve the current
>>>>>> behavior.
>>>>>>
>>>>>> I'm still a little hesitant to make these functions an explicit part
>>>>>> of LogicalTypeConversion for another reason. Down the road, schemas could
>>>>>> give us an avenue to use a batched columnar format (presumably arrow, but
>>>>>> of course others are possible). By making to/from an explicit part of
>>>>>> logical types we add some element-wise logic to a schema representation
>>>>>> that's otherwise ambivalent to element-wise vs. batched encodings.
>>>>>>
>>>>>> I suppose you could make an argument that to/from are only for
>>>>>> custom types. There will also be some set of well-known types identified
>>>>>> only by URN and some parameters, which could easily be translated to a
>>>>>> columnar format. We could just not support custom types fully if we add a
>>>>>> columnar encoding, or maybe add optional toBatch/fromBatch functions
>>>>>> when/if we get there.
>>>>>>
>>>>>> What about something like this that makes the two different types of
>>>>>> logical types explicit?
>>>>>>
>>>>>> // Describes a logical type and how to convert between it and its
>>>>>> representation (e.g. Row).
>>>>>> message LogicalTypeConversion {
>>>>>>   oneof conversion {
>>>>>>     message Standard standard = 1;
>>>>>>     message Custom custom = 2;
>>>>>>   }
>>>>>>
>>>>>>   message Standard {
>>>>>>     String urn = 1;
>>>>>>     repeated string args = 2; // could also be a map
>>>>>>   }
>>>>>>
>>>>>>   message Custom {
>>>>>>     FunctionSpec(?) toRepresentation = 1;
>>>>>>     FunctionSpec(?) fromRepresentation = 2;
>>>>>>     bytes type = 3; // e.g. serialized class for Java
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>> And LogicalType and Schema become:
>>>>>>
>>>>>> message LogicalType {
>>>>>>   FieldType representation = 1;
>>>>>>   LogicalTypeConversion conversion = 2;
>>>>>> }
>>>>>>
>>>>>> message Schema {
>>>>>>   ...
>>>>>>   repeated Field fields = 1;
>>>>>>   LogicalTypeConversion conversion = 2; // implied that
>>>>>> representation is Row
>>>>>> }
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>> On Sat, Jun 1, 2019 at 10:44 AM Reuven Lax <relax@google.com> wrote:
>>>>>>
>>>>>>> Keep in mind that right now the SchemaRegistry is only assumed to
>>>>>>> exist at graph-construction time, not at execution time; all information in
>>>>>>> the schema registry is embedded in the SchemaCoder, which is the only thing
>>>>>>> we keep around when the pipeline is actually running. We could look into
>>>>>>> changing this, but it would potentially be a very big change, and I do
>>>>>>> think we should start getting users actively using schemas soon.
>>>>>>>
>>>>>>> On Fri, May 31, 2019 at 3:40 PM Brian Hulette <bhulette@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> > Can you propose what the protos would look like in this case?
>>>>>>>> Right now LogicalType does not contain the to/from conversion functions in
>>>>>>>> the proto. Do you think we'll need to add these in?
>>>>>>>>
>>>>>>>> Maybe. Right now the proposed LogicalType message is pretty
>>>>>>>> simple/generic:
>>>>>>>> message LogicalType {
>>>>>>>>   FieldType representation = 1;
>>>>>>>>   string logical_urn = 2;
>>>>>>>>   bytes logical_payload = 3;
>>>>>>>> }
>>>>>>>>
>>>>>>>> If we keep just logical_urn and logical_payload, the
>>>>>>>> logical_payload could itself be a protobuf with attributes of 1) a
>>>>>>>> serialized class and 2/3) to/from functions. Or, alternatively, we could
>>>>>>>> have a generalization of the SchemaRegistry for logical types.
>>>>>>>> Implementations for standard types and user-defined types would be
>>>>>>>> registered by URN, and the SDK could look them up given just a URN. I put a
>>>>>>>> brief section about this alternative in the doc last week [1]. What I
>>>>>>>> suggested there included removing the logical_payload field, which is
>>>>>>>> probably overkill. The critical piece is just relying on a registry in the
>>>>>>>> SDK to look up types and to/from functions rather than storing them in the
>>>>>>>> portable schema itself.
>>>>>>>>
>>>>>>>> I kind of like keeping the LogicalType message generic for now,
>>>>>>>> since it gives us a way to try out these various approaches, but maybe
>>>>>>>> that's just a cop out.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy
>>>>>>>>
>>>>>>>> On Fri, May 31, 2019 at 12:36 PM Reuven Lax <relax@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, May 28, 2019 at 10:11 AM Brian Hulette <
>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, May 26, 2019 at 1:25 PM Reuven Lax <relax@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette <
>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> *tl;dr:* SchemaCoder represents a logical type with a base
>>>>>>>>>>>> type of Row and we should think about that.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm a little concerned that the current proposals for a
>>>>>>>>>>>> portable representation don't actually fully represent Schemas. It seems to
>>>>>>>>>>>> me that the current java-only Schemas are made up three concepts that are
>>>>>>>>>>>> intertwined:
>>>>>>>>>>>> (a) The Java SDK specific code for schema inference, type
>>>>>>>>>>>> coercion, and "schema-aware" transforms.
>>>>>>>>>>>> (b) A RowCoder[1] that encodes Rows[2] which have a particular
>>>>>>>>>>>> Schema[3].
>>>>>>>>>>>> (c) A SchemaCoder[4] that has a RowCoder for a
>>>>>>>>>>>> particular schema, and functions for converting Rows with that schema
>>>>>>>>>>>> to/from a Java type T. Those functions and the RowCoder are then composed
>>>>>>>>>>>> to provider a Coder for the type T.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> RowCoder is currently just an internal implementation detail, it
>>>>>>>>>>> can be eliminated. SchemaCoder is the only thing that determines a schema
>>>>>>>>>>> today.
>>>>>>>>>>>
>>>>>>>>>> Why not keep it around? I think it would make sense to have a
>>>>>>>>>> RowCoder implementation in every SDK, as well as something like SchemaCoder
>>>>>>>>>> that defines a conversion from that SDK's "Row" to the language type.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The point is that from a programmer's perspective, there is
>>>>>>>>> nothing much special about Row. Any type can have a schema, and the only
>>>>>>>>> special thing about Row is that it's always guaranteed to exist. From that
>>>>>>>>> standpoint, Row is nearly an implementation detail. Today RowCoder is never
>>>>>>>>> set on _any_ PCollection, it's literally just used as a helper library, so
>>>>>>>>> there's no real need for it to exist as a "Coder."
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> We're not concerned with (a) at this time since that's specific
>>>>>>>>>>>> to the SDK, not the interface between them. My understanding is we just
>>>>>>>>>>>> want to define a portable representation for (b) and/or (c).
>>>>>>>>>>>>
>>>>>>>>>>>> What has been discussed so far is really just a portable
>>>>>>>>>>>> representation for (b), the RowCoder, since the discussion is only around
>>>>>>>>>>>> how to represent the schema itself and not the to/from functions.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Correct. The to/from functions are actually related to a). One
>>>>>>>>>>> of the big goals of schemas was that users should not be forced to operate
>>>>>>>>>>> on rows to get schemas. A user can create PCollection<MyRandomType> and as
>>>>>>>>>>> long as the SDK can infer a schema from MyRandomType, the user never needs
>>>>>>>>>>> to even see a Row object. The to/fromRow functions are what make this work
>>>>>>>>>>> today.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> One of the points I'd like to make is that this type coercion is
>>>>>>>>>> a useful concept on it's own, separate from schemas. It's especially useful
>>>>>>>>>> for a type that has a schema and is encoded by RowCoder since that can
>>>>>>>>>> represent many more types, but the type coercion doesn't have to be tied to
>>>>>>>>>> just schemas and RowCoder. We could also do type coercion for types that
>>>>>>>>>> are effectively wrappers around an integer or a string. It could just be a
>>>>>>>>>> general way to map language types to base types (i.e. types that we have a
>>>>>>>>>> coder for). Then it just becomes a general framework for extending coders
>>>>>>>>>> to represent more language types.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Let's not tie those conversations. Maybe a similar concept will
>>>>>>>>> hold true for general coders (or we might decide to get rid of coders in
>>>>>>>>> favor of schemas, in which case that becomes moot), but I don't think we
>>>>>>>>> should prematurely generalize.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> One of the outstanding questions for that schema representation
>>>>>>>>>>>> is how to represent logical types, which may or may not have some language
>>>>>>>>>>>> type in each SDK (the canonical example being a timsetamp type with seconds
>>>>>>>>>>>> and nanos and java.time.Instant). I think this question is critically
>>>>>>>>>>>> important, because (c), the SchemaCoder, is actually *defining a logical
>>>>>>>>>>>> type* with a language type T in the Java SDK. This becomes clear when you
>>>>>>>>>>>> compare SchemaCoder[4] to the Schema.LogicalType interface[5] - both
>>>>>>>>>>>> essentially have three attributes: a base type, and two functions for
>>>>>>>>>>>> converting to/from that base type. The only difference is for SchemaCoder
>>>>>>>>>>>> that base type must be a Row so it can be represented by a Schema alone,
>>>>>>>>>>>> while LogicalType can have any base type that can be represented by
>>>>>>>>>>>> FieldType, including a Row.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is not true actually. SchemaCoder can have any base type,
>>>>>>>>>>> that's why (in Java) it's SchemaCoder<T>. This is why PCollection<T> can
>>>>>>>>>>> have a schema, even if T is not Row.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> I'm not sure I effectively communicated what I meant - When I
>>>>>>>>>> said SchemaCoder's "base type" I wasn't referring to T, I was referring to
>>>>>>>>>> the base FieldType, whose coder we use for this type. I meant "base type"
>>>>>>>>>> to be analogous to LogicalType's `getBaseType`, or what Kenn is suggesting
>>>>>>>>>> we call "representation" in the portable beam schemas doc. To define some
>>>>>>>>>> terms from my original message:
>>>>>>>>>> base type = an instance of FieldType, crucially this is something
>>>>>>>>>> that we have a coder for (be it VarIntCoder, Utf8Coder, RowCoder, ...)
>>>>>>>>>> language type (or "T", "type T", "logical type") = Some Java
>>>>>>>>>> class (or something analogous in the other SDKs) that we may or may not
>>>>>>>>>> have a coder for. It's possible to define functions for converting
>>>>>>>>>> instances of the language type to/from the base type.
>>>>>>>>>>
>>>>>>>>>> I was just trying to make the case that SchemaCoder is really a
>>>>>>>>>> special case of LogicalType, where `getBaseType` always returns a Row with
>>>>>>>>>> the stored Schema.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yeah, I think  I got that point.
>>>>>>>>>
>>>>>>>>> Can you propose what the protos would look like in this case?
>>>>>>>>> Right now LogicalType does not contain the to/from conversion functions in
>>>>>>>>> the proto. Do you think we'll need to add these in?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> To make the point with code: SchemaCoder<T> can be made to
>>>>>>>>>> implement Schema.LogicalType<T,Row> with trivial implementations of
>>>>>>>>>> getBaseType, toBaseType, and toInputType (I'm not trying to say we should
>>>>>>>>>> or shouldn't do this, just using it illustrate my point):
>>>>>>>>>>
>>>>>>>>>> class SchemaCoder extends CustomCoder<T> implements
>>>>>>>>>> Schema.LogicalType<T, Row> {
>>>>>>>>>>   ...
>>>>>>>>>>
>>>>>>>>>>   @Override
>>>>>>>>>>   FieldType getBaseType() {
>>>>>>>>>>     return FieldType.row(getSchema());
>>>>>>>>>>   }
>>>>>>>>>>
>>>>>>>>>>   @Override
>>>>>>>>>>   public Row toBaseType() {
>>>>>>>>>>     return this.toRowFunction.apply(input);
>>>>>>>>>>   }
>>>>>>>>>>
>>>>>>>>>>   @Override
>>>>>>>>>>   public T toInputType(Row base) {
>>>>>>>>>>     return this.fromRowFunction.apply(base);
>>>>>>>>>>   }
>>>>>>>>>>   ...
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> I think it may make sense to fully embrace this duality, by
>>>>>>>>>>>> letting SchemaCoder have a baseType other than just Row and renaming it to
>>>>>>>>>>>> LogicalTypeCoder/LanguageTypeCoder. The current Java SDK schema-aware
>>>>>>>>>>>> transforms (a) would operate only on LogicalTypeCoders with a Row base
>>>>>>>>>>>> type. Perhaps some of the current schema logic could  alsobe applied more
>>>>>>>>>>>> generally to any logical type  - for example, to provide type coercion for
>>>>>>>>>>>> logical types with a base type other than Row, like int64 and a timestamp
>>>>>>>>>>>> class backed by millis, or fixed size bytes and a UUID class. And having a
>>>>>>>>>>>> portable representation that represents those (non Row backed) logical
>>>>>>>>>>>> types with some URN would also allow us to pass them to other languages
>>>>>>>>>>>> without unnecessarily wrapping them in a Row in order to use SchemaCoder.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think the actual overlap here is between the to/from functions
>>>>>>>>>>> in SchemaCoder (which is what allows SchemaCoder<T> where T != Row) and the
>>>>>>>>>>> equivalent functionality in LogicalType. However making all of schemas
>>>>>>>>>>> simply just a logical type feels a bit awkward and circular to me. Maybe we
>>>>>>>>>>> should refactor that part out into a LogicalTypeConversion proto, and
>>>>>>>>>>> reference that from both LogicalType and from SchemaCoder?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> LogicalType is already potentially circular though. A schema can
>>>>>>>>>> have a field with a logical type, and that logical type can have a base
>>>>>>>>>> type of Row with a field with a logical type (and on and on...). To me it
>>>>>>>>>> seems elegant, not awkward, to recognize that SchemaCoder is just a special
>>>>>>>>>> case of this concept.
>>>>>>>>>>
>>>>>>>>>> Something like the LogicalTypeConversion proto would definitely
>>>>>>>>>> be an improvement, but I would still prefer just using a top-level logical
>>>>>>>>>> type :)
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I've added a section to the doc [6] to propose this alternative
>>>>>>>>>>>> in the context of the portable representation but I wanted to bring it up
>>>>>>>>>>>> here as well to solicit feedback.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59
>>>>>>>>>>>> [3]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48
>>>>>>>>>>>> [4]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33
>>>>>>>>>>>> [5]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489
>>>>>>>>>>>> [6]
>>>>>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 10, 2019 at 9:16 AM Brian Hulette <
>>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Ah thanks! I added some language there.
>>>>>>>>>>>>>
>>>>>>>>>>>>> *From: *Kenneth Knowles <kenn@apache.org>
>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 5:31 PM
>>>>>>>>>>>>> *To: *dev
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> *From: *Brian Hulette <bhulette@google.com>
>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 2:02 PM
>>>>>>>>>>>>>> *To: * <dev@beam.apache.org>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We briefly discussed using arrow schemas in place of beam
>>>>>>>>>>>>>>> schemas entirely in an arrow thread [1]. The biggest reason not to this was
>>>>>>>>>>>>>>> that we wanted to have a type for large iterables in beam schemas. But
>>>>>>>>>>>>>>> given that large iterables aren't currently implemented, beam schemas look
>>>>>>>>>>>>>>> very similar to arrow schemas.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think it makes sense to take inspiration from arrow
>>>>>>>>>>>>>>> schemas where possible, and maybe even copy them outright. Arrow already
>>>>>>>>>>>>>>> has a portable (flatbuffers) schema representation [2], and implementations
>>>>>>>>>>>>>>> for it in many languages that we may be able to re-use as we bring schemas
>>>>>>>>>>>>>>> to more SDKs (the project has Python and Go implementations). There are a
>>>>>>>>>>>>>>> couple of concepts in Arrow schemas that are specific for the format and
>>>>>>>>>>>>>>> wouldn't make sense for us, (fields can indicate whether or not they are
>>>>>>>>>>>>>>> dictionary encoded, and the schema has an endianness field), but if you
>>>>>>>>>>>>>>> drop those concepts the arrow spec looks pretty similar to the beam proto
>>>>>>>>>>>>>>> spec.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> FWIW I left a blank section in the doc for filling out what
>>>>>>>>>>>>>> the differences are and why, and conversely what the interop opportunities
>>>>>>>>>>>>>> may be. Such sections are some of my favorite sections of design docs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E
>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *From: *Robert Bradshaw <robertwb@google.com>
>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 1:38 PM
>>>>>>>>>>>>>>> *To: *dev
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: Reuven Lax <relax@google.com>
>>>>>>>>>>>>>>>> Date: Thu, May 9, 2019 at 7:29 PM
>>>>>>>>>>>>>>>> To: dev
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > Also in the future we might be able to do optimizations
>>>>>>>>>>>>>>>> at the runner level if at the portability layer we understood schemes
>>>>>>>>>>>>>>>> instead of just raw coders. This could be things like only parsing a subset
>>>>>>>>>>>>>>>> of a row (if we know only a few fields are accessed) or using a columnar
>>>>>>>>>>>>>>>> data structure like Arrow to encode batches of rows across portability.
>>>>>>>>>>>>>>>> This doesn't affect data semantics of course, but having a richer,
>>>>>>>>>>>>>>>> more-expressive type system opens up other opportunities.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But we could do all of that with a RowCoder we understood
>>>>>>>>>>>>>>>> to designate
>>>>>>>>>>>>>>>> the type(s), right?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <
>>>>>>>>>>>>>>>> robertwb@google.com> wrote:
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> On the flip side, Schemas are equivalent to the space of
>>>>>>>>>>>>>>>> Coders with
>>>>>>>>>>>>>>>> >> the addition of a RowCoder and the ability to
>>>>>>>>>>>>>>>> materialize to something
>>>>>>>>>>>>>>>> >> other than bytes, right? (Perhaps I'm missing something
>>>>>>>>>>>>>>>> big here...)
>>>>>>>>>>>>>>>> >> This may make a backwards-compatible transition easier.
>>>>>>>>>>>>>>>> (SDK-side, the
>>>>>>>>>>>>>>>> >> ability to reason about and operate on such types is of
>>>>>>>>>>>>>>>> course much
>>>>>>>>>>>>>>>> >> richer than anything Coders offer right now.)
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> From: Reuven Lax <relax@google.com>
>>>>>>>>>>>>>>>> >> Date: Thu, May 9, 2019 at 4:52 PM
>>>>>>>>>>>>>>>> >> To: dev
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> > FYI I can imagine a world in which we have no coders.
>>>>>>>>>>>>>>>> We could define the entire model on top of schemas. Today's "Coder" is
>>>>>>>>>>>>>>>> completely equivalent to a single-field schema with a logical-type field
>>>>>>>>>>>>>>>> (actually the latter is slightly more expressive as you aren't forced to
>>>>>>>>>>>>>>>> serialize into bytes).
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > Due to compatibility constraints and the effort that
>>>>>>>>>>>>>>>> would be  involved in such a change, I think the practical decision should
>>>>>>>>>>>>>>>> be for schemas and coders to coexist for the time being. However when we
>>>>>>>>>>>>>>>> start planning Beam 3.0, deprecating coders is something I would like to
>>>>>>>>>>>>>>>> suggest.
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw <
>>>>>>>>>>>>>>>> robertwb@google.com> wrote:
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> From: Kenneth Knowles <kenn@apache.org>
>>>>>>>>>>>>>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM
>>>>>>>>>>>>>>>> >> >> To: dev
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> > This is a huge development. Top posting because I
>>>>>>>>>>>>>>>> can be more compact.
>>>>>>>>>>>>>>>> >> >> >
>>>>>>>>>>>>>>>> >> >> > I really think after the initial idea converges
>>>>>>>>>>>>>>>> this needs a design doc with goals and alternatives. It is an
>>>>>>>>>>>>>>>> extraordinarily consequential model change. So in the spirit of doing the
>>>>>>>>>>>>>>>> work / bias towards action, I created a quick draft at
>>>>>>>>>>>>>>>> https://s.apache.org/beam-schemas and added everyone on
>>>>>>>>>>>>>>>> this thread as editors. I am still in the process of writing this to match
>>>>>>>>>>>>>>>> the thread.
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> Thanks! Added some comments there.
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> > *Multiple timestamp resolutions*: you can use
>>>>>>>>>>>>>>>> logcial types to represent nanos the same way Java and proto do.
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> As per the other discussion, I'm unsure the value in
>>>>>>>>>>>>>>>> supporting
>>>>>>>>>>>>>>>> >> >> multiple timestamp resolutions is high enough to
>>>>>>>>>>>>>>>> outweigh the cost.
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> > *Why multiple int types?* The domain of values for
>>>>>>>>>>>>>>>> these types are different. For a language with one "int" or "number" type,
>>>>>>>>>>>>>>>> that's another domain of values.
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> What is the value in having different domains? If
>>>>>>>>>>>>>>>> your data has a
>>>>>>>>>>>>>>>> >> >> natural domain, chances are it doesn't line up
>>>>>>>>>>>>>>>> exactly with one of
>>>>>>>>>>>>>>>> >> >> these. I guess it's for languages whose types have
>>>>>>>>>>>>>>>> specific domains?
>>>>>>>>>>>>>>>> >> >> (There's also compactness in representation, encoded
>>>>>>>>>>>>>>>> and in-memory,
>>>>>>>>>>>>>>>> >> >> though I'm not sure that's high.)
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> > *Columnar/Arrow*: making sure we unlock the ability
>>>>>>>>>>>>>>>> to take this path is Paramount. So tying it directly to a row-oriented
>>>>>>>>>>>>>>>> coder seems counterproductive.
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> I don't think Coders are necessarily row-oriented.
>>>>>>>>>>>>>>>> They are, however,
>>>>>>>>>>>>>>>> >> >> bytes-oriented. (Perhaps they need not be.) There
>>>>>>>>>>>>>>>> seems to be a lot of
>>>>>>>>>>>>>>>> >> >> overlap between what Coders express in terms of
>>>>>>>>>>>>>>>> element typing
>>>>>>>>>>>>>>>> >> >> information and what Schemas express, and I'd rather
>>>>>>>>>>>>>>>> have one concept
>>>>>>>>>>>>>>>> >> >> if possible. Or have a clear division of
>>>>>>>>>>>>>>>> responsibilities.
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> > *Multimap*: what does it add over an array-valued
>>>>>>>>>>>>>>>> map or large-iterable-valued map? (honest question, not rhetorical)
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> Multimap has a different notion of what it means to
>>>>>>>>>>>>>>>> contain a value,
>>>>>>>>>>>>>>>> >> >> can handle (unordered) unions of non-disjoint keys,
>>>>>>>>>>>>>>>> etc. Maybe this
>>>>>>>>>>>>>>>> >> >> isn't worth a new primitive type.
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> > *URN/enum for type names*: I see the case for both.
>>>>>>>>>>>>>>>> The core types are fundamental enough they should never really change -
>>>>>>>>>>>>>>>> after all, proto, thrift, avro, arrow, have addressed this (not to mention
>>>>>>>>>>>>>>>> most programming languages). Maybe additions once every few years. I prefer
>>>>>>>>>>>>>>>> the smallest intersection of these schema languages. A oneof is more clear,
>>>>>>>>>>>>>>>> while URN emphasizes the similarity of built-in and logical types.
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> Hmm... Do we have any examples of the multi-level
>>>>>>>>>>>>>>>> primitive/logical
>>>>>>>>>>>>>>>> >> >> type in any of these other systems? I have a bias
>>>>>>>>>>>>>>>> towards all types
>>>>>>>>>>>>>>>> >> >> being on the same footing unless there is compelling
>>>>>>>>>>>>>>>> reason to divide
>>>>>>>>>>>>>>>> >> >> things into primitive/use-defined ones.
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> Here it seems like the most essential value of the
>>>>>>>>>>>>>>>> primitive type set
>>>>>>>>>>>>>>>> >> >> is to describe the underlying representation, for
>>>>>>>>>>>>>>>> encoding elements in
>>>>>>>>>>>>>>>> >> >> a variety of ways (notably columnar, but also
>>>>>>>>>>>>>>>> interfacing with other
>>>>>>>>>>>>>>>> >> >> external systems like IOs). Perhaps, rather than the
>>>>>>>>>>>>>>>> previous
>>>>>>>>>>>>>>>> >> >> suggestion of making everything a logical of bytes,
>>>>>>>>>>>>>>>> this could be made
>>>>>>>>>>>>>>>> >> >> clear by still making everything a logical type, but
>>>>>>>>>>>>>>>> renaming
>>>>>>>>>>>>>>>> >> >> "TypeName" to Representation. There would be URNs
>>>>>>>>>>>>>>>> (typically with
>>>>>>>>>>>>>>>> >> >> empty payloads) for the various primitive types
>>>>>>>>>>>>>>>> (whose mapping to
>>>>>>>>>>>>>>>> >> >> their representations would be the identity).
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> - Robert
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Mime
View raw message