streams-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Blackmon <>
Subject [UPDATE] Namespacing of configuration inputs to components
Date Wed, 14 Feb 2018 19:29:47 GMT
TL/DR; another PR that wraps up the streams-config upgrade

I realized when refactoring some of the examples working on STREAMS-582:
Apply canonical classname name-spacing to all modules (, that the way the
examples define and resolve a full-pipeline configuration ought to benefit
from this pattern as well, so the branch contains a few revisions to
ComponentConfigurator, and an enhancement to StreamsConfigurator.

I have created a ticket and pushed a branch that adds this ability - in a
stand-alone commit to make it easier to review.

Now you can instantiate a typed StreamsConfigurator using any bean that
extends StreamsConfiguration.  The static methods on StreamsConfigurator
are unchanged, but in addition we can now resolve a pipeline configuration
(like those used in the examples) with each property fully resolved (based
on the class defined) using the new
ComponentConfigurator.detectConfiguration logic.

The only aspect of this change that might be breaking is that
StreamsConfiguration is now additionalProperties: false.  I think this is
appropriate, but it will require anyone who has been pulling custom details
through that way to instead define a class that extends
StreamsConfiguration, and use a typed StreamsConfigurator, thus getting
access to all of the new name spacing + fallback logic. I'm thinking we
should probably add additionalProperties: false at the root of all
configuration object json schemas, for similar reasons.

Additionally, a comprehensive ( and large ) PR is open for STREAMS-582:
Apply canonical classname name-spacing to all modules ( - it builds upon
STREAMS-583 then fixes up all classes and tests to use
ComponentConfigurator.detectConfiguration() and all examples to use
StreamsConfigurator.detectCustomConfiguration(), leaving all of the legacy
hard-coded keys removed and every unit and integration tests passing.

This turned out to be a big effort with many files touched but IMO makes
stream configuration more flexible, easier to understand, and would require
less boilerplate and glue code for simple, intermediate, and advanced
use-cases.  I’d like to write a confluence blog post about the new
streams-config that would explain how it works and why, and be easier to
follow than this email thread has been.


On Feb 6, 2018 at 12:31 PM, Steve Blackmon <> wrote:

I realize it would useful to pick up configuration details from all
ancestor classes as well, so I added that as well.

So the set of sources from which a configuration value can be sourced are
(in preference order)

1) full canonical name of class
i.e. org.apache.streams.twitter.config.TwitterEngagersProviderConfiguration

2) simple name of class i.e. just TwitterEngagersProviderConfiguration

3) any ancestor Class, closest ancestors preferred,
i.e.  org.apache.streams.twitter.config.TwitterTimelineProviderConfiguration,
then  org.apache.streams.twitter.config.TwitterUserInformationConfiguration,

(java.lang.Object is immune from this treatment)

4) the package of the Class, i.e. org.apache.streams.twitter.config

5) all parent packages of the Class, closest packages preferred, i.e.
org.apache.streams.twitter, then
org.apache.streams, then
org.apache, then

I look forward to getting some eyes, comments, and +1’s on this work, as it
will unlock some in-progress features and make configuration of code build
with streams far more flexible going forward!


On Feb 1, 2018 at 12:06 PM, Steve Blackmon <> wrote:

I think I’ve come up with a very nice solution to this problem.

- Add a new method to ComponentConfigurator - detectConfiguration()
- When the caller does not provide a Config or a path to
detectConfiguration, get fancy:
- Search for each of the fields declared by the component POJO class on
each of the following:
- the SimpleClassName
- the CanonicalClassName
- each ancestor package of the CanonicalClassName, longest to shortest
- if a field is specified at more than one package/class level, the class
or longest package ancestor takes precedence.

I created STREAMS-580 and submitted a PR:

The unit tests appear to be working as described above - please check it
out and let me know if anything looks off.

P.S. Note that after merging this capability, we’d still need to migrate
each component that currently hard-codes its configuration path to adopt
this method instead.

On Jan 31, 2018 at 7:47 PM, Steve Blackmon <> wrote:

TL;DR I think we should start aligning the default JVM config path with the
package/class of the code being configured

Hello All,

I’ve been working on two providers (TwitterEngagersProvider and
InstagramEngagersProvider) which share significant code with other

This is sort of complex but here’s the nature of the implementation:
a) Instantiate and run provider A, which generates user objects or user ids
b) Instantiate and run provider B, using the output of provider A as the
c) All of this happens in provider C, which then transforms the output of
provider B, exploding each item into a list as it’s own output.

Due mostly to conventions that have been in the project for a while,
providers A, B, and C all expect to find their configuration at the same
place in the JVM properties - either ‘twitter’ or ‘instagram’.
Additionally, providers A, B, and C all expect to be configured with a
max_items parameter.

So you can probably see the challenge - configuring provider C with
max_items = 1000 winds up giving unhelpful direction to providers A and B
about how much work they should do.  We don’t need or want 1000 user ids,
we really want 1000 engager ids which can typically be derived from ~50
user ids.

While on the surface its nice to be able to reuse a single conf file to run
a variety of providers, I see now that creates problems when running code
that integrates a variety of components that expect to be configured from
overlapping paths.  It’s possible to get around these problems by writing
glue code, but it seems like it would be smarter to have all classes by
default source their configuration using a simple and obvious namespace
strategy based on package/class.

So instead of every twitter provider picking up:
twitter.max_items = 1000

They would all be configured separately like:
org.apache.streams.twitter.providers.ProviderA.max_items = 10
org.apache.streams.twitter.providers.ProviderB.max_items = 100
org.apache.streams.twitter.providers.ProviderC.max_items = 1000

Since there are other configuration properties that are required but do not
differ between the three providers, a more typical configuration file would
really look like:
org.apache.streams.twitter.Twitter = { oauth = { … } }
org.apache.streams.twitter.providers.ProviderA =
org.apache.streams.twitter.providers.ProviderA.max_items = 10
org.apache.streams.twitter.providers.ProviderB =
org.apache.streams.twitter.providers.ProviderB.max_items = 100
org.apache.streams.twitter.providers.ProviderC =
org.apache.streams.twitter.providers.ProviderC.max_items = 1000

This would be a breaking refactoring, targeted for the 0.6.0 release.

I’m interested in others feedback on the idea in general, and
specifically.  For example whether fully qualified class name or just
simple name is preferable as a prefix.  And whether the class name used
should be that of the configuration bean i.e. TwitterConfiguration or of
the class that operates on it i.e. Twitter

Thanks for reading,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message