esme-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ethan Jewett <esjew...@gmail.com>
Subject Re: Article: Enterprise microblogging needs a facelift to rival email
Date Wed, 21 Jul 2010 13:55:39 GMT
Inline:

On Tue, Jul 20, 2010 at 7:41 AM, Vassil Dichev <vdichev@apache.org> wrote:
>> I pretty much agree on all counts except the performance evaluation of
>> the two options, which is what this message is about after the second
>> paragraph.
>
> Of course, there's no disagreement, because we're describing different
> things. I basically responded to your statement "we probably need to
> take a really hard look at the tracking approach in general." I also
> agree that if you're just looking for hashtags, you can optimize by
> extracting these for each message and lookup in a hash table instead
> of iterating all tracking matchers.

Ok, cool, I misunderstood.

I'm going to implement my suggestion with your suggested changes (see
below) in a branch and see how that goes. I've tried to think about
how to implement conversation and hashtag via tracking because we want
to represent it to the user as a "follow" that is hooked to the
conversation or tag, so it shouldn't even show up in the Tracks view.
I'll keep thinking about it, but I have a design for the approach that
runs parallel to tracks, so I'm going to try that first.

> Our tracking does not only match text, but regular expressions, a
> boolean grammar of and/or/not operators, etc.. Of course this will
> never scale, but let's face it- ESME will probably never reach
> Twitter's scale. This doesn't mean that you can't get performance
> problems in a corporation scale, but it could probably be solved with
> quotas and permissions.

I was thinking that this would cause problems with as few as 1000
users. However I have done some performance testing locally over the
last couple of days and I don't see any issues. I will keep trying
different configurations, including adding a lot more users and seeing
what happens, but if tracking doesn't cause any problem for even large
numbers of users, then I think we're OK.

> It's actually also multiplied by the number of matchers per user. This
> might mean it's more than 1,000,000 matches *if* there is more than
> one tracking per user on average. But in your scenario if only 10
> users track a hashtag, then it's 1 * 10 track matching tests. For all
> the other users there are simply no matchers in the track matchers
> list and they are simply skipped. Of course, if you mean that it's
> about 10,000 checks if the list is null, you're right.

The thing is that it will run every message through all track-matching
tests for all users. So if 1000 other users are tracking one other
hashtag each, then that's 1000 more tests, not just 10. (Your math is
exactly right for my example though.) Of course, this appears to be
just a theoretical worry for the moment since I can't seem to cause a
performance problem.

> I'm not ever sure we need the complexity of an actor per searched
> hashtag. The message is already parsed for a hashtag and statistics
> are gathered for the tag cloud. We can do other optimizations like
> construct a mailbox for hashes. What I'm not sure will scale is an
> actor per hashtag. There might just be too many to add to all the user
> actors.

You're exactly right. Thanks for pointing out that we don't need all
those extra actors, just one actor and then we need to persist lists
of users following a tag on the Tag class. Cool!

>> With regards to multiple messages ending up in the timeline, we
>> already have code in place in the UserActor.addToMailbox method that
>> will only add a message if it is not already in the Mailbox. I think
>> this is what stops the current tracking mechanism from adding messages
>> to the mailbox again when they already appear in my timeline.
>
> Yes, that's what I was referring to. The problem is not implementing
> this behaviour, but the fact that it is database-intensive as it
> queries all Mailbox entries.

Hmmm, this is worrisome. So this query in the addToMailbox method isn't cached?

Mailbox.find(By(Mailbox.message, msg), By(Mailbox.user, userId)).isEmpty

Maybe it would be good enough to hit the
Mailbox.mostRecentMessagesFor(userID,20) method instead or doing this
Mailbox.find(). I think that method caches and unless someone is
getting a really huge message volume or we have a huge lag somewhere
the duplicate message should show up within the first few messages.

Ethan

Mime
View raw message