samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Pick <pickda...@gmail.com>
Subject Re: Question on table to table joins
Date Fri, 21 Nov 2014 18:57:18 GMT
Hey Martin,

Thanks for giving this so much thought!

We were also talking about a multi stage solution similar to what you
suggested. While I think it would work, it would be a lot of overhead given
that we have a few hundred tables in our database (we don't care about them
all for this application, but a generic solution is appealing).

I'm also curious what the average Kafka topics look like for other people
pushing database changes through it.

Is every table it's own topic?
Is something like user_id denmoralized onto every table so that they can
all be partitioned evenly?
How many topics can the average cluster handle? (If we end up with a lot of
Samza jobs that are all producing Kafka topics does it fall over at some
point?)

Are there any good resources for that kind of information?

David

On Thu, Nov 20, 2014 at 6:48 PM, Martin Kleppmann <martin@kleppmann.com>
wrote:

> On 19 Nov 2014, at 20:21, David Pick <pickdavid@gmail.com> wrote:
> > Right now it's hundreds of gig. Like I mentioned before we'd love to
> > partition our data better, but because we don't have a common join key on
> > every table we haven't really come up with a good scheme for doing this.
>
> Your many-to-many join example is interesting. I think you could probably
> do it as a multi-stage pipeline. Say you want to generate a list of all the
> transactions for a particular customer. Your input streams are:
>
> * Customers, partitioned by customer_id
> * Customer_Transactions, partitioned by customer_id
> * Transactions, partitioned by transaction_id
>
> The data flow is:
>
> 1. Join Customers and Customer_Transactions on customer_id, and emit
> messages of the form (Customer, transaction_id), partitioned by
> transaction_id.
> 2. Join output of step 1 with Transactions on transaction_id, and emit
> messages of the form (Customer, Transaction), partitioned by customer_id.
> 3. Consume output of step 2, and group them by customer_id to get all the
> transactions for one customer in one place.
>
> That may seem a bit convoluted, but it might actually perform quite well.
> At least it will allow you to break the data into many partitions. (I'm
> hoping we can build higher-level tools in future which will abstract away
> this complexity.)
>
> > This is great, though I think ideally what we would want is to have a
> kafka
> > topic per database table, and a Samza task that could consume all of one
> > topic for a table that's fairly small (e.g. customers) and a single
> > partition of a big table (e.g. posts).
>
> This may be even closer to what you need:
> https://issues.apache.org/jira/browse/SAMZA-402
>
> Best,
> Martin
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message