samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Riccomini <criccom...@linkedin.com.INVALID>
Subject Re: Coast, and a few implementation questions
Date Fri, 07 Nov 2014 16:47:35 GMT
Hey Ben,

> but I think coast actually has a pretty good shot at making that easier
>-- it has quite a lot of 'structural' knowledge about the flow of data,
>so it should be able to do a pretty good job of inserting the necessary
>checks / checkpoints / etc. one DAG node at a time.

True. Given that you know exactly what computation is going on, it seems
more tractable. I'm curious how you plan to implement exactly once. Do you
have any docs?

> One thing that pulled me to Samza is that it's pretty explicit about the
>log model, so these things are closer to the surface.

Yes, that's one of our explicit goals. Samza's stream model is nearly
identical to Kafka's. It's very heavy weight, and expects repayable,
durable, fault tolerant, partitioned, ordered streams. Having these
expectations helps to make Samza, itself, a bit simpler. :)

Cheers,
Chris

On 11/6/14 10:06 AM, "Ben Kirwin" <ben@kirw.in> wrote:

>Thanks for the interest! Responses inline.
>
>> # Do you plan to automate the Samza config generation portion, so that
>>you
>> can run the full flow from a single command?
>
>Yes, definitely. The config generation is just there so I could keep
>the existing workflow mostly unchanged, but it's easy to swap out.
>(I'm being intentionally noncommittal about this for now.) My hacky
>config-writing code is here:
>
>https://github.com/bkirwi/incubator-samza-hello-samza/blob/hello-coast/sam
>za-wikipedia/src/main/scala/samza/examples/coast/ConfigGenerationApp.scala
>#L52-L62
>
>If you replace those lines with `val job = jobFactory.getJob(config);
>job.submit()` or similar, you should get a single-command launcher. I
>haven't done it with YARN yet, but my integration tests work like
>this.
>
>> # What can we do to help?
>
>Fielding these questions has already been a big help; thanks! My goal
>right now is to get something that works on the existing 0.8 branch;
>once that's done, it should be more obvious if there's anything Samza
>can do to make things cleaner / more performant.
>
>> Samza, itself, will handle exactly-once messaging at some point, but it
>> depends on yet-to-be-implemented Kafka features. Are you saying that
>> you're trying to implement exactly-once messaging on top of Samza?
>
>Yes, that's right.
>
>> If so, yes, this will be pretty tricky, and likely impact performance.
>>I believe
>> Trident's exactly-once performance suffers for this reason, as well. We
>> looked at doing things this way, but opted for the Kafka-based approach.
>
>I don't actually think this is as crazy as it may seem. A Kafka log is
>actually a decent coordination primitive, and performance is generally
>very good -- unlike the Trident / Kafka-transactional approach, many
>jobs require no extra I/O on the hot path or access to a central
>coordinator. There's definitely a cognitive overhead to doing this
>manually, but I think coast actually has a pretty good shot at making
>that easier -- it has quite a lot of 'structural' knowledge about the
>flow of data, so it should be able to do a pretty good job of
>inserting the necessary checks / checkpoints / etc. one DAG node at a
>time.
>
>The toughest part so far has actually been dealing with the Kafka
>ecosystem... a lot of tools try and hide the log abstraction from you
>or bake in particular assumptions, which makes life hard when you need
>precise control. One thing that pulled me to Samza is that it's pretty
>explicit about the log model, so these things are closer to the
>surface.
>
>Anyways, I get that I'm swimming against the current here, but that's
>the plan for now. Should be straightforward to switch to another
>approach if things go badly...
>
>>> Should I create the system myself from config during task init, or is
>>>there a better way?
>>
>> Yea, for now, you'll have to create the system yourself. One thing that
>> I'd been batting around was the idea of exposing all of the objects via
>> the SamzaContainerContext, but that's far off.
>
>Great, that works. Having access to these things in the context would
>be convenient, but it's certainly not a blocker.
>
>>> Am I missing anything? If not, would it be worth adding support for
>>>this?
>>
>> You're not missing anything. This seems like a worth-while feature, and
>>is
>> similar to the TaskCoordinator.shutdown() work that Martin did to add a
>> request scope. Can you open a JIRA for this?
>
>Done. For those following along at home, the ticket is here:
>
>https://issues.apache.org/jira/browse/SAMZA-459


Mime
View raw message