samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From prateekm <...@git.apache.org>
Subject [GitHub] samza pull request #293: Support for declaring serdes in high level API code...
Date Fri, 08 Sep 2017 21:54:11 GMT
GitHub user prateekm opened a pull request:

    https://github.com/apache/samza/pull/293

    Support for declaring serdes in high level API code for input/output/intermediate streams.

    @nickpan47 @vjagadish1989 @jmakes Please take a look.
    
    Some notes/considerations:
    * Serde interface is now Serializable. I've updated the built-in serde implementations
to make sure they are Serializable, but custom serde implementations will need to be updated
by users when they upgrade.
    * I've tried to preserve the number of serde objects and their usage before and after
serialization, but don't guarantee that any references shared by these serde instances before
serialization will be shared after deserialization. Ideally, serde of serdes should be done
in one pass along with all other user provided objects in the DAG so that any shared references
can be preserved.
    * I've left them as-is for now, but we probably need to move Serdes to samza-api now that
users can interact with them directly.
    * Samza's JsonSerde uses Samza's custom ObjectMapper since it was probably meant to be
an Samza internal serde. One peculiarity is that it forces the 'dasherized-names' property
naming convention instead of the (default in Jackson and more common) camelCase naming convention.
This makes user POJO definitions more verbose with JsonProperty annotations, and if users
don't remember to dasherize property names the resulting errors are somewhat confusing to
debug. Should we create a new JsonSerde with a default ObjectMapper when we move it to samza-api?
    * Note that the default value for a 'defaultKey/MsgSerde` is a no-op serde instead of
null. This helps users of Systems that do the serialization/deserialization themselves (e.g.
Wikipedia example), but the downside is that it is possible to forget setting the default
serde before creating streams and end up with a no-op serde and ClassCastExceptions at runtime.
    * I've made a choice here to disallow setting system/stream serdes in config to make configs
simpler. Imho the 'serdes only in code' and 'only 2 levels of serde' model is much simpler
to understand. The risk is that there might be streams that aren't part of the user application
code, but need their serdes to be set in configuration. An example is the logging topic. Do
you think this is a reasonable choice?
    * There's a separate discussion to be had about introducing a KV type and specifying serdes
for such KV types instead of specifying 'key' and 'msg' serdes separately.
    
    Tested with samza-hello-world Wikipedia application and all TestRepartitionWindowApp.
Will add unit tests if the overall design is acceptable.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/prateekm/samza serde-instance

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/samza/pull/293.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #293
    
----
commit 0d399607064faecc983526c255c0b4037ae84aa0
Author: Prateek Maheshwari <pmaheshw@linkedin.com>
Date:   2017-09-08T21:27:58Z

    Adding support for declaring serdes in high level API code for input/output/intermediate
stream.

----


---

Mime
View raw message