incubator-esme-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ethan Jewett <esjew...@gmail.com>
Subject Re: Memory usage analysis
Date Mon, 23 Nov 2009 21:57:40 GMT
Actually, to amend this: In looking at the code carefully, I don't see
where the public timeline Comet actors actually duplicate the messages
in the queue. They appear to only store a List of message IDs.
However, it looks like it does possibly pre-render the full list of
messages? I think David or Vassil would have to comment.

On the other hand, I think the API actor does duplicate the messages,
so I'll have to look at fixing that up if it is really the case.

I'd love to learn MAT. Just a matter of finding the time!

Ethan

On Mon, Nov 23, 2009 at 2:57 PM, Markus Kohler <markus.kohler@gmail.com> wrote:
> Hi all,
> Again moving this thread to the mailing list.
>
> See my comments below.
>
> Regards,
> Markus
>
>
> "The best way to predict the future is to invent it" -- Alan Kay
>
>
> On Sun, Nov 22, 2009 at 5:39 PM, Ethan Jewett <esjewett@gmail.com> wrote:
>
>> Hi Markus,
>>
>> This is intriguing, though it's actually not that surprising. I think
>> David and Vassil might have explanations for a lot of this, but maybe
>> I can shed a little light based on how I understand ESME should work:
>>
>> On Sat, Nov 21, 2009 at 5:11 PM, Markus Kohler <markus.kohler@gmail.com>
>> wrote:
>>
>> > This means that "Europe/Berlin" appears 11729 times in memory at the
>> point
>> > in time the heap dump was taken. If those Strings could be
>> removed 750.656
>> > or more bytes could be reclaimed.  This copies seems to come
>> > from net.liftweb.http.RenderOut btw.
>> > For shallow versus retained size/heap
>> > check
>> http://kohlerm.blogspot.com/2009/02/how-to-really-measure-memory-usage-of.html
>> .
>> > If you look further below in the table you can see that it seems that
>> each
>> > message is there 323 times, which is equal to the number of users.
>> > Note that I manually created only 2 messages "Test123" and "Test1234",
>> the
>> > others are just status updates.
>> > This means that with n users potentially producing n times more messages
>> > than a single user, that will not only need n times more memory, but n
>> times
>> > n e.g. O(n^2).
>>
>> I think that theoretically this isn't quite right. A lot of the
>> duplicate messages you are seeing are in the public timeline queues
>> that are created when a user is logged into the website (or an API
>> session). Messages do duplicate in these queues so that the exact set
>> of messages the user needs is available for download. Once the user
>> downloads them to the web interface or via the API delta mechanism,
>> the queue should be cleared out.
>
>
> Ok, I think I understand what you mean, but shouldn't we then see less
> duplicates for the last user because this user should have got all the
> updates.
> And even then i think the message text is exactly the same there is
> not personalization involved and therefore it's not necessary to store the
> duplicates of the Strings. If it's needed that the messages are copied of
> each user one could still share the String for equal messages. The messages
> are immutable so there shouldn't be a problem with this approach, I think.
>
>
>> (Would be excellent to test this.) So
>> worst case, you have O(n*m) memory usage where n is the number of
>> created messages and m is the number of *logged in* users.
>>
>> That's still pretty terrible, but it's not out of hand. I think this
>> points us in the direction of getting rid of the public timeline
>> altogether, since it is most of the problem. Without the public
>> timeline, if every user is following 10 other users, then you only get
>> O(10*m) where m is the number of users logged in.
>>
>>
> I agree that the public timeline is a major issue at least for larger number
> of users. I think I already suggested to remove it. But I think one major
> issue with scaling twitter like services, is that there might be burst of
> messages after certain poplar events and that the nr. of followers probably
> follows the power law (http://en.wikipedia.org/wiki/Power_law) This means
> there are a few users with an very high follower rate which could cause
> message storms and essentialy cause a lot of messages to be copied.
>
>
>> It's possible that this is not the behavior you're seeing, but if it's
>> not, then I think this is a bug.
>>
>>
> I think the easiest way to analyze this  is I to show you how to use MAT and
> then you can browser the the object graph.
>
>
>> > This will kill us as the number of users goes up. I understand that in
>> Scala
>> > a lot of objects can be immutable, but I still think it doesn't make
>> sense
>> > to hold copies of those Strings.
>>
>> It does't make sense in general, but I think it does make sense when
>> you are queuing up messages that you know users are going to request.
>> Making the queues immutable and, more importantly, self-contained
>> results in a situation in which a single user's request for new
>> messages is extremely cheap. All the work has already been done. The
>> inability to split the work to generate the message queue from the
>> actual request for the queue was what killed Twitter and what David (I
>> understand) had a large hand in fixing. That experience informs a lot
>> of the work done on ESME. I agree on reducing memory footprint, but
>> there are significant advantages to having this type of duplication
>> (though keeping the size of the duplication as small as possible would
>> be excellent).
>>
>>
> Immutable is only a relative I think. For example Strings in Java are
> immutable, but that doesn't mean that if you do an operation on String that
> "changes" it that then the whole object graph of the String is copied. In
> fact Strings point to char[] and those char[] can be shared (people are
> often surprised), depending on what operations you call.
>
> My idea is that the same could be done with the messages.
>
>
>
>> > There are long chains of scala.$colon$colon everywhere and this class
>> seems
>> > to cause quite some overhead ( linked lists I guess)
>> > Check histogram.zip. The shallow size is already 10% of the heap.
>> > The dominator tree
>> > (http://kohlerm.blogspot.com/2009/02/memory-leaks-are-easy-to-find.html
>> )  grouped
>> > by class (dominator_tree.zip) shows that 63% of the memory is spend in
>> the
>> > the org.apache.esme.comet.PublicTimeline. Haven't looked into the
>> details,
>> > but I guess that overhead is caused by  those String duplicates.
>>
>> I'm not surprised by this. As mentioned above, the public timeline is
>> something like size O(n*m) in memory while the non-public timelines
>> are more like O(10*m) in total, for some value of 10. (Ok, I should be
>> using 1, but my non-computer-science background is showing :-)
>>
>
>
> So Linked lists are used for Queues?
> If so, is it "only" because they are easy to use in Scala? I found that in
> most cases queues can be implemented with Array like structures more
> efficiently. You just maintain a start end end "pointer" into the Array.
> Check
> http://www.java-tips.org/java-se-tips/java.lang/array-based-queue-implementation-in-java.html
> to
> get the idea (not sure whether this impl is any good).
>
> Arrays are just more compact and cache friendly.
>
>
>>
>> I think it would be very educational to take a look at what happens
>> when you have fewer users logged in, when there are more messages, and
>> ideally with a modified version of ESME that doesn't provide the
>> public timeline (possibly based on api2 calls?).
>>
>>
> Sure. My plan would be to extend the script, such that users would also
> follow a few other users and also post messages.
> Ideally at some point the followers number should follower the power law I
> guess.
>
> But so far there are enough problems. For example posting immediately after
> all users where logged on is very slow.
>
>
>
> Regards,
> Markus
>
>
>> Hopefully that's a bit helpful!
>>
>
> Yes, very helpful indeed!
>
>>
>> Ethan
>>
>

Mime
View raw message