incubator-esme-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Kohler <markus.koh...@gmail.com>
Subject Re: Memory usage analysis
Date Mon, 23 Nov 2009 20:57:44 GMT
Hi all,
Again moving this thread to the mailing list.

See my comments below.

Regards,
Markus


"The best way to predict the future is to invent it" -- Alan Kay


On Sun, Nov 22, 2009 at 5:39 PM, Ethan Jewett <esjewett@gmail.com> wrote:

> Hi Markus,
>
> This is intriguing, though it's actually not that surprising. I think
> David and Vassil might have explanations for a lot of this, but maybe
> I can shed a little light based on how I understand ESME should work:
>
> On Sat, Nov 21, 2009 at 5:11 PM, Markus Kohler <markus.kohler@gmail.com>
> wrote:
>
> > This means that "Europe/Berlin" appears 11729 times in memory at the
> point
> > in time the heap dump was taken. If those Strings could be
> removed 750.656
> > or more bytes could be reclaimed.  This copies seems to come
> > from net.liftweb.http.RenderOut btw.
> > For shallow versus retained size/heap
> > check
> http://kohlerm.blogspot.com/2009/02/how-to-really-measure-memory-usage-of.html
> .
> > If you look further below in the table you can see that it seems that
> each
> > message is there 323 times, which is equal to the number of users.
> > Note that I manually created only 2 messages "Test123" and "Test1234",
> the
> > others are just status updates.
> > This means that with n users potentially producing n times more messages
> > than a single user, that will not only need n times more memory, but n
> times
> > n e.g. O(n^2).
>
> I think that theoretically this isn't quite right. A lot of the
> duplicate messages you are seeing are in the public timeline queues
> that are created when a user is logged into the website (or an API
> session). Messages do duplicate in these queues so that the exact set
> of messages the user needs is available for download. Once the user
> downloads them to the web interface or via the API delta mechanism,
> the queue should be cleared out.


Ok, I think I understand what you mean, but shouldn't we then see less
duplicates for the last user because this user should have got all the
updates.
And even then i think the message text is exactly the same there is
not personalization involved and therefore it's not necessary to store the
duplicates of the Strings. If it's needed that the messages are copied of
each user one could still share the String for equal messages. The messages
are immutable so there shouldn't be a problem with this approach, I think.


> (Would be excellent to test this.) So
> worst case, you have O(n*m) memory usage where n is the number of
> created messages and m is the number of *logged in* users.
>
> That's still pretty terrible, but it's not out of hand. I think this
> points us in the direction of getting rid of the public timeline
> altogether, since it is most of the problem. Without the public
> timeline, if every user is following 10 other users, then you only get
> O(10*m) where m is the number of users logged in.
>
>
I agree that the public timeline is a major issue at least for larger number
of users. I think I already suggested to remove it. But I think one major
issue with scaling twitter like services, is that there might be burst of
messages after certain poplar events and that the nr. of followers probably
follows the power law (http://en.wikipedia.org/wiki/Power_law) This means
there are a few users with an very high follower rate which could cause
message storms and essentialy cause a lot of messages to be copied.


> It's possible that this is not the behavior you're seeing, but if it's
> not, then I think this is a bug.
>
>
I think the easiest way to analyze this  is I to show you how to use MAT and
then you can browser the the object graph.


> > This will kill us as the number of users goes up. I understand that in
> Scala
> > a lot of objects can be immutable, but I still think it doesn't make
> sense
> > to hold copies of those Strings.
>
> It does't make sense in general, but I think it does make sense when
> you are queuing up messages that you know users are going to request.
> Making the queues immutable and, more importantly, self-contained
> results in a situation in which a single user's request for new
> messages is extremely cheap. All the work has already been done. The
> inability to split the work to generate the message queue from the
> actual request for the queue was what killed Twitter and what David (I
> understand) had a large hand in fixing. That experience informs a lot
> of the work done on ESME. I agree on reducing memory footprint, but
> there are significant advantages to having this type of duplication
> (though keeping the size of the duplication as small as possible would
> be excellent).
>
>
Immutable is only a relative I think. For example Strings in Java are
immutable, but that doesn't mean that if you do an operation on String that
"changes" it that then the whole object graph of the String is copied. In
fact Strings point to char[] and those char[] can be shared (people are
often surprised), depending on what operations you call.

My idea is that the same could be done with the messages.



> > There are long chains of scala.$colon$colon everywhere and this class
> seems
> > to cause quite some overhead ( linked lists I guess)
> > Check histogram.zip. The shallow size is already 10% of the heap.
> > The dominator tree
> > (http://kohlerm.blogspot.com/2009/02/memory-leaks-are-easy-to-find.html
> )  grouped
> > by class (dominator_tree.zip) shows that 63% of the memory is spend in
> the
> > the org.apache.esme.comet.PublicTimeline. Haven't looked into the
> details,
> > but I guess that overhead is caused by  those String duplicates.
>
> I'm not surprised by this. As mentioned above, the public timeline is
> something like size O(n*m) in memory while the non-public timelines
> are more like O(10*m) in total, for some value of 10. (Ok, I should be
> using 1, but my non-computer-science background is showing :-)
>


So Linked lists are used for Queues?
If so, is it "only" because they are easy to use in Scala? I found that in
most cases queues can be implemented with Array like structures more
efficiently. You just maintain a start end end "pointer" into the Array.
Check
http://www.java-tips.org/java-se-tips/java.lang/array-based-queue-implementation-in-java.html
to
get the idea (not sure whether this impl is any good).

Arrays are just more compact and cache friendly.


>
> I think it would be very educational to take a look at what happens
> when you have fewer users logged in, when there are more messages, and
> ideally with a modified version of ESME that doesn't provide the
> public timeline (possibly based on api2 calls?).
>
>
Sure. My plan would be to extend the script, such that users would also
follow a few other users and also post messages.
Ideally at some point the followers number should follower the power law I
guess.

But so far there are enough problems. For example posting immediately after
all users where logged on is very slow.



Regards,
Markus


> Hopefully that's a bit helpful!
>

Yes, very helpful indeed!

>
> Ethan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message