harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Senaka Fernando" <senaka...@gmail.com>
Subject Parrot's discussion on GC Improvements
Date Sun, 13 Apr 2008 12:15:05 GMT
FYI, Parrot's discussion on GC Improvements

---------- Forwarded message ----------
From: chromatic <chromatic@wgz.org>
Date: Sat, Apr 12, 2008 at 5:57 AM
Subject: The Big Three Rakudo (and Parrot OO) Bottlenecks
To: parrot-porters@perl.org

I've committed a couple of minor optimizations which speed up Rakudo and
OO in general by about 35%.  There may be a few more lurking, but I keep
running into three spots which dominate most of the other optimization
strategies I might pursue.

1) The Garbage Collector is algorithmically inefficient.  There are various
potential optimization strategies here which don't require us walking every
allocated header in the system multiple times (yes, it's that bad in at
two places), but the real solution I suspect is to replace the current GC
with at least the one specified in the new GC PDD.

The two worst spots are turning off the live flag (and destroying everything
with a custom destroy flag) and dividing newly-allocated memory pools into
headers and manually adding each header to a linked-list of free objects.

Flag marking would be much cheaper (and likely cache-friendlier, though I
haven't done the math to make sure of this) if we used some sort of
card-marking scheme, where we can scan 8 or 16 or 32 or 64 bits in a bitmap
at a time to see if we need to do anything in specific for objects, rather
than walking through all of the memory pools in the system and checking
flags, one header at a time.  It might also make our PObjs a couple of bytes
smaller, which puts that much less pressure on the GC and means we have to
run a full GC that much less frequently.  I really recommend that the new GC
use this approach.

The free object list is the reason that compacting/copying collectors are
popular, specifically that all you have to do to find the next free object
bump the pointer by sizeof (header) and see if that's still within the
of the pool.  We could ameliorate that somewhat by using a hybrid scheme of
pointer bumping and then free-list checking, which would remove the cost of
dividing fresh pools into headers and making the linked list at the cost of
little more complexity in the allocator.  I'm not convinced this is an
appropriate tradeoff for the 10% improvement it's likely to give.  A copying
or collecting scheme would give us this benefit almost for free (where the
only cost is replacing the entire garbage collector and rewriting a fair
chunk of the internals of the system).

With that all said, one good optimization is to allocate fewer PObjs, so if
there's a hotspot in either PIR or C where you create a new PMC or STRING
don't always need (or you could cache it), that's almost always a minor
optimization.  (It's almost always worth doing only in hotspots though, so
have to be able to identify those in PIR... soon.)

2) PIR vtable overrides are very naive.  See src/oo.c, specifically
Parrot_oo_find_vtable_override.  Currently there are very few PIR-level
overrides of vtable entries, but the "Is there an override?" scheme here
walks through all of the parents of a Class every time something wants to
check if there's a vtable override -- and both the Class and Object PMCs
check for overrides, namely get_class and [gs]et_attr_str.  I made a quick
and dirty optimization to this function to check only the current class and
it gave me a 12.5% speed improvement over the full check, running my
NQP/Rakudo benchmark.  'make coretest' all passed as well.

We have a dynamic system such that we can add and remove parents as well as
methods and vtable entries at runtime, but we should still be able to cache
this information somehow such that we don't continually look for things that
don't exist, probably won't exist, and certainly haven't sprung into
existence since the most recent check.

I don't have a complete answer for this yet, but we should be able to have
smarter caching without removing the essential dynamism from the system.

3) PCC argument processing is slow.  I've looked over Parrot_pass_args and
Parrot_process_args several times in the past few months, and I didn't see
any obvious speedups or tricks.  However, we do spend a lot of time
data back and forth, and something (instinct, blind desire, lower blood
than normal) suggests that we already *know* some of this information
already.  That is, at the point of a call we know if there are slurpy or
named arguments and if the invoked function wants to bind slurpy or named
parameters.  It seems like we could have a fast path for that common
operation... somehow.

Suggestions, analysis, and patches are all very much welcome.

-- c

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message