incubator-esme-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vassil Dichev <vdic...@apache.org>
Subject Re: Article: Enterprise microblogging needs a facelift to rival email
Date Mon, 19 Jul 2010 11:04:11 GMT
> The moon is pretty ominous looking in Hamburg tonight, which means
> that it's time for "contrary Ethan" to come out of hiding.

Hi there, "contrary Ethan" ;-)

> Really all I want to do is get anyone who might implement this to
> think twice about using tracking to do this. In fact, we probably need
> to take a really hard look at the tracking approach in general. Unless
> I'm mistaken, every message created is distributed to every user actor
> in the system to check if it matches a tracker for that user.

This is correct. So let's try to take a really hard look into the
current implementation and the alternative.

> Can anyone else verify that? If this is how it's really set up, then
> it's going to become a major problem in systems with large numbers of
> users and I'll need to open a Jira item for it and we'll figure out
> what to do. I'm not sure how to do it better right now, but if that's
> how it works then it's an issue.

If it's an issue, then it seems the only way to solve this would be to
disable tracking completely, just like Twitter did back in the days
when there was tracking via IM.

Let me also note that tracking is way more powerful than just search
for a hashtag. There is a limited way of hashtags you can find in a
message, but in practice there are an unlimited number of matchers you
can construct on a message, because they use the same filters as
actions. This means that it's harder to have optimization shortcuts
for finding which existing tracking matchers match.

> With regards to hashtag following, I think the right way to do this is
> to set up another actor like a User actor or (to be created)
> Conversation actor and when someone follows a hashtag the actor will
> put the message on that user's timeline.
>
> Note that these conversation and hashtag actors only need to be
> started up when someone follows a conversation or hashtag, and there
> would be only one actor object per conversation/hashtag as opposed to
> one per user following (as in the case of tracking). This approach is
> also pretty efficient because we can look at a message and know
> exactly which conversation or hashtag actor we need to forward it to
> without querying 1000s of actors to see if they are interested in it.

Hm, I'm not sure how this would be more efficient. Instead of having
an actor per user which checks the track matchers per user, you'll
have an actor per track which checks the users for each track matcher.

The subtle problem here is with having unique track matchers as
opposed to unique users. When you have the UserActor-first approach,
you might have duplicate matches. In other words, you might scan for
the same tracking match again for another user. However, when you find
even one matcher that matches, you stop matching and append the
message to the user's timeline.

In contrast, when you go with the TrackingActor-first approach, you
might have duplicate user matches. This means that one message might
match a user multiple times for different tracking matches. Then in
order to ensure uniqueness, one approach is that the mailbox must be
scanned every time to see if the message is not already there. Another
approach would be to have a set of users to send a message to, in
which you guarantee that a user is only included once- this is
identical in complexity to sorting the list of users to send to.

Another point to have in mind is: in a typical scenario, do you expect
users or hashtags to have a higher number?

Finally, having hashtags/conversation actors will not eliminate the
need for user actors, so they will be just added to the total actor
count.

There is no getting around the fact that if we want tracking, each and
every message has to be scanned for each and every user's track
criteria. Whether we invert the order doesn't reduce the overall
complexity much. Changing the current implementation might help if
there is an order of magnitude more users than track searches.
However, since a user can have many tracked searches, it seems that at
least the worst-case scenario will create many more actors and
messages sent for the model you're suggesting.

I'm not sure I explained this clearly enough, I can give more specific
examples and/or diagrams. Eventually it's worth to simulate a model
and see which one would scale in practice.

Vassil

Mime
View raw message