samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Kirwin <...@kirw.in>
Subject Re: Coast, and a few implementation questions
Date Thu, 06 Nov 2014 18:06:50 GMT
Thanks for the interest! Responses inline.

> # Do you plan to automate the Samza config generation portion, so that you
> can run the full flow from a single command?

Yes, definitely. The config generation is just there so I could keep
the existing workflow mostly unchanged, but it's easy to swap out.
(I'm being intentionally noncommittal about this for now.) My hacky
config-writing code is here:

https://github.com/bkirwi/incubator-samza-hello-samza/blob/hello-coast/samza-wikipedia/src/main/scala/samza/examples/coast/ConfigGenerationApp.scala#L52-L62

If you replace those lines with `val job = jobFactory.getJob(config);
job.submit()` or similar, you should get a single-command launcher. I
haven't done it with YARN yet, but my integration tests work like
this.

> # What can we do to help?

Fielding these questions has already been a big help; thanks! My goal
right now is to get something that works on the existing 0.8 branch;
once that's done, it should be more obvious if there's anything Samza
can do to make things cleaner / more performant.

> Samza, itself, will handle exactly-once messaging at some point, but it
> depends on yet-to-be-implemented Kafka features. Are you saying that
> you're trying to implement exactly-once messaging on top of Samza?

Yes, that's right.

> If so, yes, this will be pretty tricky, and likely impact performance. I believe
> Trident's exactly-once performance suffers for this reason, as well. We
> looked at doing things this way, but opted for the Kafka-based approach.

I don't actually think this is as crazy as it may seem. A Kafka log is
actually a decent coordination primitive, and performance is generally
very good -- unlike the Trident / Kafka-transactional approach, many
jobs require no extra I/O on the hot path or access to a central
coordinator. There's definitely a cognitive overhead to doing this
manually, but I think coast actually has a pretty good shot at making
that easier -- it has quite a lot of 'structural' knowledge about the
flow of data, so it should be able to do a pretty good job of
inserting the necessary checks / checkpoints / etc. one DAG node at a
time.

The toughest part so far has actually been dealing with the Kafka
ecosystem... a lot of tools try and hide the log abstraction from you
or bake in particular assumptions, which makes life hard when you need
precise control. One thing that pulled me to Samza is that it's pretty
explicit about the log model, so these things are closer to the
surface.

Anyways, I get that I'm swimming against the current here, but that's
the plan for now. Should be straightforward to switch to another
approach if things go badly...

>> Should I create the system myself from config during task init, or is
>>there a better way?
>
> Yea, for now, you'll have to create the system yourself. One thing that
> I'd been batting around was the idea of exposing all of the objects via
> the SamzaContainerContext, but that's far off.

Great, that works. Having access to these things in the context would
be convenient, but it's certainly not a blocker.

>> Am I missing anything? If not, would it be worth adding support for this?
>
> You're not missing anything. This seems like a worth-while feature, and is
> similar to the TaskCoordinator.shutdown() work that Martin did to add a
> request scope. Can you open a JIRA for this?

Done. For those following along at home, the ticket is here:

https://issues.apache.org/jira/browse/SAMZA-459

Mime
View raw message