incubator-esme-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ethan Jewett <esjew...@gmail.com>
Subject Re: Article: Enterprise microblogging needs a facelift to rival email
Date Mon, 19 Jul 2010 13:24:02 GMT
Hey Vassil,

I pretty much agree on all counts except the performance evaluation of
the two options, which is what this message is about after the second
paragraph.

I agree that tracks are extremely powerful, but I don't see how they
can scale. I'm definitely proposing something much less powerful, but
I think it will scale a lot better (see below). Twitter still does not
allow tracking on arbitrary text, so I guess they haven't been able to
crack this nut either. I don't think tracks are highly used right now,
but if we introduce easy conversation and hashtag following I think it
will be highly used, so I want to make sure we do it in a way that
will scale.

Just as an example of how our current track system and a more limited
system based on hashtags or conversations compare from a performance
perspective, let's say we have the following scenario:

1000 users on the system
10 users following the hashtag "#testtag"
1 message sent with text "Hello, I'm a #testtag"

What happens in the scenario where we use a Track of the "#testtag"
text is the following, as far as I can tell:

1. The message is sent to the Distributor
2. The distributor sends the message to 1000 UserActor instances
3. Each UserActor evaluates the message (1000 evaluations - some more
expensive than others, depending on the # of tracks the user has set
up)
4. If the message matches (for 10 users), the message is added to the timeline

In this scenario we have 1000 attempts to match a message. In this
system architecture we get must be able to handle a number of matches
per minute that is expressed by the equation NUM_MESSAGES_PER_MIN *
NUM_USERS. If we have 10,000 users and 100 messages in a minute,
that's 1,000,000 matches.

In the (much more limited) scenario I'm proposing based on
HashTagActor and ConversationActor, the following happens:

1. The message is sent to the Distributor
2. The Distributor parses out the tags in the message - in this case
there is tag "#testtag"
3. The Distributor checks to see if there is a HashTagActor for
#testtag (it does in this case because 10 people are following the
#testtag hashtag) and sends the message on to that actor.
4. The HashTagActor for #testtag tells each UserActor to add the
message to its mailbox.

In this scenario we have 1 operation to parse out the tags in the
message, one lookup for the HashTagActor (similar to or better in
complexity than the tracking match that each UserActor does) for each
tag in the message, and 10 addToMailbox calls (the same as in the
previous example). This number does not change if there are more users
in the system, only if there are more users following a particular
hashtag.

Similar math applies for the conversation scenario. I've simplified
some of the message-passing-around that goes on but is the same in
both scenarios.

Note that the HashTagActor and ConversationActors would not even need
to be resident in main memory for this to be fairly fast. They could
be lazily constructed as needed.

With regards to multiple messages ending up in the timeline, we
already have code in place in the UserActor.addToMailbox method that
will only add a message if it is not already in the Mailbox. I think
this is what stops the current tracking mechanism from adding messages
to the mailbox again when they already appear in my timeline.

The fact is that we currently have the Tracks implementation and
overhead when sending a message, and the incremental overhead
introduced by using this existing implementation for HashTag or
Conversation tracking is fairly minimal (similar, but I think still
more complex than my alternative). What I'm worried about is
implementing on top of Tracking and then finding ourselves in a
situation where Tracking is causing us to not be able to scale but
where we can't remove tracking because core functionality has been
built on top of it.

For this reason, I guess now is a good time to have this conversation! :-)

Ethan

On Mon, Jul 19, 2010 at 1:04 PM, Vassil Dichev <vdichev@apache.org> wrote:
>> The moon is pretty ominous looking in Hamburg tonight, which means
>> that it's time for "contrary Ethan" to come out of hiding.
>
> Hi there, "contrary Ethan" ;-)
>
>> Really all I want to do is get anyone who might implement this to
>> think twice about using tracking to do this. In fact, we probably need
>> to take a really hard look at the tracking approach in general. Unless
>> I'm mistaken, every message created is distributed to every user actor
>> in the system to check if it matches a tracker for that user.
>
> This is correct. So let's try to take a really hard look into the
> current implementation and the alternative.
>
>> Can anyone else verify that? If this is how it's really set up, then
>> it's going to become a major problem in systems with large numbers of
>> users and I'll need to open a Jira item for it and we'll figure out
>> what to do. I'm not sure how to do it better right now, but if that's
>> how it works then it's an issue.
>
> If it's an issue, then it seems the only way to solve this would be to
> disable tracking completely, just like Twitter did back in the days
> when there was tracking via IM.
>
> Let me also note that tracking is way more powerful than just search
> for a hashtag. There is a limited way of hashtags you can find in a
> message, but in practice there are an unlimited number of matchers you
> can construct on a message, because they use the same filters as
> actions. This means that it's harder to have optimization shortcuts
> for finding which existing tracking matchers match.
>
>> With regards to hashtag following, I think the right way to do this is
>> to set up another actor like a User actor or (to be created)
>> Conversation actor and when someone follows a hashtag the actor will
>> put the message on that user's timeline.
>>
>> Note that these conversation and hashtag actors only need to be
>> started up when someone follows a conversation or hashtag, and there
>> would be only one actor object per conversation/hashtag as opposed to
>> one per user following (as in the case of tracking). This approach is
>> also pretty efficient because we can look at a message and know
>> exactly which conversation or hashtag actor we need to forward it to
>> without querying 1000s of actors to see if they are interested in it.
>
> Hm, I'm not sure how this would be more efficient. Instead of having
> an actor per user which checks the track matchers per user, you'll
> have an actor per track which checks the users for each track matcher.
>
> The subtle problem here is with having unique track matchers as
> opposed to unique users. When you have the UserActor-first approach,
> you might have duplicate matches. In other words, you might scan for
> the same tracking match again for another user. However, when you find
> even one matcher that matches, you stop matching and append the
> message to the user's timeline.
>
> In contrast, when you go with the TrackingActor-first approach, you
> might have duplicate user matches. This means that one message might
> match a user multiple times for different tracking matches. Then in
> order to ensure uniqueness, one approach is that the mailbox must be
> scanned every time to see if the message is not already there. Another
> approach would be to have a set of users to send a message to, in
> which you guarantee that a user is only included once- this is
> identical in complexity to sorting the list of users to send to.
>
> Another point to have in mind is: in a typical scenario, do you expect
> users or hashtags to have a higher number?
>
> Finally, having hashtags/conversation actors will not eliminate the
> need for user actors, so they will be just added to the total actor
> count.
>
> There is no getting around the fact that if we want tracking, each and
> every message has to be scanned for each and every user's track
> criteria. Whether we invert the order doesn't reduce the overall
> complexity much. Changing the current implementation might help if
> there is an order of magnitude more users than track searches.
> However, since a user can have many tracked searches, it seems that at
> least the worst-case scenario will create many more actors and
> messages sent for the model you're suggesting.
>
> I'm not sure I explained this clearly enough, I can give more specific
> examples and/or diagrams. Eventually it's worth to simulate a model
> and see which one would scale in practice.
>
> Vassil
>

Mime
View raw message