incubator-esme-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vassil Dichev <vdic...@apache.org>
Subject Re: Article: Enterprise microblogging needs a facelift to rival email
Date Tue, 20 Jul 2010 05:41:40 GMT
> I pretty much agree on all counts except the performance evaluation of
> the two options, which is what this message is about after the second
> paragraph.

Of course, there's no disagreement, because we're describing different
things. I basically responded to your statement "we probably need to
take a really hard look at the tracking approach in general." I also
agree that if you're just looking for hashtags, you can optimize by
extracting these for each message and lookup in a hash table instead
of iterating all tracking matchers.

> I agree that tracks are extremely powerful, but I don't see how they
> can scale. I'm definitely proposing something much less powerful, but
> I think it will scale a lot better (see below). Twitter still does not
> allow tracking on arbitrary text, so I guess they haven't been able to
> crack this nut either. I don't think tracks are highly used right now,
> but if we introduce easy conversation and hashtag following I think it
> will be highly used, so I want to make sure we do it in a way that
> will scale.

Our tracking does not only match text, but regular expressions, a
boolean grammar of and/or/not operators, etc.. Of course this will
never scale, but let's face it- ESME will probably never reach
Twitter's scale. This doesn't mean that you can't get performance
problems in a corporation scale, but it could probably be solved with
quotas and permissions.

> Just as an example of how our current track system and a more limited
> system based on hashtags or conversations compare from a performance
> perspective, let's say we have the following scenario:
>
> 1000 users on the system
> 10 users following the hashtag "#testtag"
> 1 message sent with text "Hello, I'm a #testtag"
>
> What happens in the scenario where we use a Track of the "#testtag"
> text is the following, as far as I can tell:
>
> 1. The message is sent to the Distributor
> 2. The distributor sends the message to 1000 UserActor instances
> 3. Each UserActor evaluates the message (1000 evaluations - some more
> expensive than others, depending on the # of tracks the user has set
> up)
> 4. If the message matches (for 10 users), the message is added to the timeline
>
> In this scenario we have 1000 attempts to match a message. In this
> system architecture we get must be able to handle a number of matches
> per minute that is expressed by the equation NUM_MESSAGES_PER_MIN *
> NUM_USERS. If we have 10,000 users and 100 messages in a minute,
> that's 1,000,000 matches.

It's actually also multiplied by the number of matchers per user. This
might mean it's more than 1,000,000 matches *if* there is more than
one tracking per user on average. But in your scenario if only 10
users track a hashtag, then it's 1 * 10 track matching tests. For all
the other users there are simply no matchers in the track matchers
list and they are simply skipped. Of course, if you mean that it's
about 10,000 checks if the list is null, you're right.

> In the (much more limited) scenario I'm proposing based on
> HashTagActor and ConversationActor, the following happens:
>
> 1. The message is sent to the Distributor
> 2. The Distributor parses out the tags in the message - in this case
> there is tag "#testtag"
> 3. The Distributor checks to see if there is a HashTagActor for
> #testtag (it does in this case because 10 people are following the
> #testtag hashtag) and sends the message on to that actor.
> 4. The HashTagActor for #testtag tells each UserActor to add the
> message to its mailbox.
>
> In this scenario we have 1 operation to parse out the tags in the
> message, one lookup for the HashTagActor (similar to or better in
> complexity than the tracking match that each UserActor does) for each
> tag in the message, and 10 addToMailbox calls (the same as in the
> previous example). This number does not change if there are more users
> in the system, only if there are more users following a particular
> hashtag.
>
> Similar math applies for the conversation scenario. I've simplified
> some of the message-passing-around that goes on but is the same in
> both scenarios.
>
> Note that the HashTagActor and ConversationActors would not even need
> to be resident in main memory for this to be fairly fast. They could
> be lazily constructed as needed.

I'm not ever sure we need the complexity of an actor per searched
hashtag. The message is already parsed for a hashtag and statistics
are gathered for the tag cloud. We can do other optimizations like
construct a mailbox for hashes. What I'm not sure will scale is an
actor per hashtag. There might just be too many to add to all the user
actors.

> With regards to multiple messages ending up in the timeline, we
> already have code in place in the UserActor.addToMailbox method that
> will only add a message if it is not already in the Mailbox. I think
> this is what stops the current tracking mechanism from adding messages
> to the mailbox again when they already appear in my timeline.

Yes, that's what I was referring to. The problem is not implementing
this behaviour, but the fact that it is database-intensive as it
queries all Mailbox entries.

> The fact is that we currently have the Tracks implementation and
> overhead when sending a message, and the incremental overhead
> introduced by using this existing implementation for HashTag or
> Conversation tracking is fairly minimal (similar, but I think still
> more complex than my alternative). What I'm worried about is
> implementing on top of Tracking and then finding ourselves in a
> situation where Tracking is causing us to not be able to scale but
> where we can't remove tracking because core functionality has been
> built on top of it.

Yes, this is very sound reasoning and I completely agree.

> For this reason, I guess now is a good time to have this conversation! :-)
>
> Ethan
>
> On Mon, Jul 19, 2010 at 1:04 PM, Vassil Dichev <vdichev@apache.org> wrote:
>>> The moon is pretty ominous looking in Hamburg tonight, which means
>>> that it's time for "contrary Ethan" to come out of hiding.
>>
>> Hi there, "contrary Ethan" ;-)
>>
>>> Really all I want to do is get anyone who might implement this to
>>> think twice about using tracking to do this. In fact, we probably need
>>> to take a really hard look at the tracking approach in general. Unless
>>> I'm mistaken, every message created is distributed to every user actor
>>> in the system to check if it matches a tracker for that user.
>>
>> This is correct. So let's try to take a really hard look into the
>> current implementation and the alternative.
>>
>>> Can anyone else verify that? If this is how it's really set up, then
>>> it's going to become a major problem in systems with large numbers of
>>> users and I'll need to open a Jira item for it and we'll figure out
>>> what to do. I'm not sure how to do it better right now, but if that's
>>> how it works then it's an issue.
>>
>> If it's an issue, then it seems the only way to solve this would be to
>> disable tracking completely, just like Twitter did back in the days
>> when there was tracking via IM.
>>
>> Let me also note that tracking is way more powerful than just search
>> for a hashtag. There is a limited way of hashtags you can find in a
>> message, but in practice there are an unlimited number of matchers you
>> can construct on a message, because they use the same filters as
>> actions. This means that it's harder to have optimization shortcuts
>> for finding which existing tracking matchers match.
>>
>>> With regards to hashtag following, I think the right way to do this is
>>> to set up another actor like a User actor or (to be created)
>>> Conversation actor and when someone follows a hashtag the actor will
>>> put the message on that user's timeline.
>>>
>>> Note that these conversation and hashtag actors only need to be
>>> started up when someone follows a conversation or hashtag, and there
>>> would be only one actor object per conversation/hashtag as opposed to
>>> one per user following (as in the case of tracking). This approach is
>>> also pretty efficient because we can look at a message and know
>>> exactly which conversation or hashtag actor we need to forward it to
>>> without querying 1000s of actors to see if they are interested in it.
>>
>> Hm, I'm not sure how this would be more efficient. Instead of having
>> an actor per user which checks the track matchers per user, you'll
>> have an actor per track which checks the users for each track matcher.
>>
>> The subtle problem here is with having unique track matchers as
>> opposed to unique users. When you have the UserActor-first approach,
>> you might have duplicate matches. In other words, you might scan for
>> the same tracking match again for another user. However, when you find
>> even one matcher that matches, you stop matching and append the
>> message to the user's timeline.
>>
>> In contrast, when you go with the TrackingActor-first approach, you
>> might have duplicate user matches. This means that one message might
>> match a user multiple times for different tracking matches. Then in
>> order to ensure uniqueness, one approach is that the mailbox must be
>> scanned every time to see if the message is not already there. Another
>> approach would be to have a set of users to send a message to, in
>> which you guarantee that a user is only included once- this is
>> identical in complexity to sorting the list of users to send to.
>>
>> Another point to have in mind is: in a typical scenario, do you expect
>> users or hashtags to have a higher number?
>>
>> Finally, having hashtags/conversation actors will not eliminate the
>> need for user actors, so they will be just added to the total actor
>> count.
>>
>> There is no getting around the fact that if we want tracking, each and
>> every message has to be scanned for each and every user's track
>> criteria. Whether we invert the order doesn't reduce the overall
>> complexity much. Changing the current implementation might help if
>> there is an order of magnitude more users than track searches.
>> However, since a user can have many tracked searches, it seems that at
>> least the worst-case scenario will create many more actors and
>> messages sent for the model you're suggesting.
>>
>> I'm not sure I explained this clearly enough, I can give more specific
>> examples and/or diagrams. Eventually it's worth to simulate a model
>> and see which one would scale in practice.
>>
>> Vassil
>>
>

Mime
View raw message