harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Harmony Bootstrap JVM <boot...@earthlink.net>
Subject Re: Some questions about the architecture
Date Fri, 21 Oct 2005 04:14:21 GMT


-----Original Message-----
From: Robin Garner <robin.garner@anu.edu.au>
Sent: Oct 20, 2005 3:08 PM
To: Apache Harmony Bootstrap JVM <bootjvm@earthlink.net>
Cc: harmony-dev@incubator.apache.org
Subject: Re: Some questions about the architecture

> Robin, Rodrigo,
>
> Perhaps the two of you could get your heads together
> on GC issues?  I think both of you have been thinking
> along related lines on the structure of GC for this JVM.
> What do you think?

I think the current challenge is to get the GC people and the VM people
thinking along the same lines when it comes to GC issues.  I think we're
both coming from the same place.

---

Probably!

---

> Further comments follow...
>
> -----Original Message-----
> From: Rodrigo Kumpera <kumpera@gmail.com>
> Sent: Oct 19, 2005 4:49 PM
> To: harmony-dev@incubator.apache.org
> Subject: Re: Some questions about the architecture
>
> On 10/19/05, Apache Harmony Bootstrap JVM <bootjvm@earthlink.net> wrote:
>>
>>
>> -----Original Message-----
>> From: Rodrigo Kumpera <kumpera@gmail.com>
>> Sent: Oct 19, 2005 1:49 PM
>> To: harmony-dev@incubator.apache.org, Apache Harmony Bootstrap JVM
>> <bootjvm@earthlink.net>
>> Subject: Re: Some questions about the architecture
>>
>> On 10/19/05, Apache Harmony Bootstrap JVM <bootjvm@earthlink.net> wrote:
>> >
> ...snip...
>>
>> Notice that in 'jvm/src/jvmcfg.h' there is a JVMCFG_GC_THREAD
>> that is used in jvm_run() as a regular thread like any other.
>> It calls gc_run() on a scheduled basis.  Also, any time an object
>> finalize() is done, gc_run() is possible.  Yes, I treat GC as a
>> stop-the-world process, but here is the key:  Due to the lack
>> of asynchronous native POSIX threads, there are no safe points
>> required.  The only thread is the SIGALRM target that sets the
>> volatile boolean in timeslice_tick() for use by opcode_run() to
>> test.  <b>This is the _only_ formally asynchrous data structure in
>> the whole machine.</b>  (Bold if you use an HTML browser, otherwise
>> clutter meant for emphasis.)  Objects that contain no references can
>> be GC'd since they are merely table entries.  Depending on how the
>> GC algorithm is done, gc_run() may or may not even need to look
>> at a particular object.
>>
>> Notice also that classes are treated in the same way by the GC API.
>> If a class is no longer referenced by any objects, it may be GC'd also.
>> First, its intrinsic class object must be GC'd, then the class itself.
>> This
>> may take more than one pass of gc_run() to make it happen.

There's a major misconception here.  As I was describing it to someone a
while ago, conceptually a garbage collected heap is actually simpler than
an explicitly managed heap.  The standard heap has 'malloc' and 'free'.  A
managed heap (with GC) just has 'malloc'.

In practice it's more complex but the principle is the same.  From the
interpreter's point of view, you just allocate.  Forever.  Reclaiming free
space is the GC's problem, because it's the only part of the VM that can
know when something is dead.  Things die when (or soon after) all
references to them die.

---

This design _only_ uses "heap.h" and friends for management of
internal JVM data structures, and _never_ repeat _never_ is available
or visible or controllable directly or indirectly by the effects of Java
bytecodes with the exception of the functions object_instance_new()
and object_instance_delete(), and then only for array objects and an
array of 'jvalue' for the fields in a class (one for static fields, the other
for instance fields), which then go to the heap for their data storage
only (an 'jlong' array of 10 elements gets 10x8, or 80 bytes, the rest
coming out of the object table, or say 5 fields, so 5 x 8, or 40 bytes).
I have separated GC out as a _completely_ different issue that relates
_only_ to the effects of Java bytecodes.

In other words, I have two separate memory management domains
so that, no matter what sort of GC is used by the virtual machine,
the real-machine code that implements it is not affected by it at all.
In this way, GC can have neither a positive nor a negative effect
on the real-machine implementation of the JVM.  This was by design,
and whether or not this was a good design choice, I think that is for
more experienced JVM architects than myself to decide.

By virtue of having GC control when object references are reclaimed,
the array and field storage ultimately falls under its control instead of
being explicitly managed.  The object_instance_new() and object_instance_delete()
being controlled by 'new' and GC, respectively.

*** WHAT SAY YOU JVM EXPERTS LURKING OUT THERE?  I KNOW
      YOU'RE READING THIS!  Please speak up!  I would like to hear
      what your experience has been so we can create the best
      solution to the issue of real and virtual machine memory management.  ***

Now to be fair with a complete disclosure at this time, my object allocation
is from a static array in the 'pjvm->object[]' array of 'robject', which has a
fixed, maximum size.  The same for classes with the 'pjvm->class[]' array
of 'rclass'.  The OBJECT() and CLASS() macros can be adjusted to reflect
any different allocation mechanism that might be chosen for any implementation,
either now or in the future, hopefully making this JVM _extremely_ modular (See
also 'README' for a section on "Subsystem component abstraction".)

Keep in mind that this JVM was _not_ designed with blinding speed in mind for
its first cut, but with the Henry Ford approach:

    1.  Sweat blood and create a Model "A".
    2.  Sell enough to make it worth the while.
    3.  Work on improvements and create a Model "B".
    4.  Go from one failure to the next with no loss of enthusiasm (Quote from Mr. Ford)
    5.  Get down to the Model "K", which had some significant success.
    6.  Keep working until you build the Model "T", which sold by the million.

I guess I'd like us to get the Model "A" out the door even as we look toward
improvements such as are being suggested from a number of folks.  If we
need to adjust the heap and GC models, sure, we can do it.  And perhaps a
new and better GC interface would be appropriate (As Robin pointed out to me
off the list).  As he also pointed out, now is the time to make an API change
like this before we get deeply into the project as a group.

I would like to see what this JVM has going for it with its design in its Model "A" state,
whether or not we adjust the GC interface paradigm.  Part of the reason I didn't
supply GC is (1) it is a crucial element, and (2) I've never done one, and (3) there
are people like Robin who have written honours theses on GC and are therefore
much more qualified.

---


GC is triggered in two cases: 1) the user code calls System.gc().  2) the
heap fills up (for some suitable definition of 'fills up').  There is
never any need for the VM code to call the garbage collector.

A consequence is that every call to 'new' needs to be a gc safe point.  If
the heap is full, there's no way to keep executing until a timer event
triggers.

What the VM needs to do is to provide services that allow the GC to do its
job.  These are at core:
- A way to allocate bulk memory (eg mmap)
- A way to enumerate roots (this is where stack scanning happens)
- A scheduling mechanism (especially for parallel GC)
- A way to enumerate the pointers in an object
- Notification (which the GC can ignore) for pointer read and write
operations (read and write barriers)

Understanding this will go a long way to getting past the disconnect we
currently have over GC issues.  When I propose the new gc interfaces, this
should become more concrete.

> That depends on the GC implementation.  Look at 'jvm/src/gc_stub.c'
> for the stub reference implementation.  To see the mechanics of
> how to fit it into the compile environment, look at the GC and heap
> setup in 'config.sh' and at 'jvm/src/heap.h' for how multiple heap
> implementations get configured in.

As mentioned before, the heap *is* the GC.

---

I think we are using the terms "heap" and "GC" with slightly different
definitions.  My definitions are stated above, where I think you are using
the terms synonomously.

Also, I have GC set up to meet the two conditions you state.  But a 'new'
event never needs a GC safe point in this implementation because of the
outer/inner loop implementation on the _same_ real-machine thread, as
described in other posts to this list.

---

> The GC interface API that I defined may or may not be adequate
> for everything.  I basically set it up so that any time an object
> reference
> was added or deleted, I called a GC function.

So is this a write barrier ?  IE, are these functions called for every
PUTFIELD, PUTSTATIC and AASTORE bytecode ?

---

No.  There are no barriers of any kind except the mutex mechanism for the
time slice thread.  The outer/inner loop interpreter structure precludes the
need for it.

Notice that the implementation will determine whether this is an efficient way
to do it or not, especially since I distinguish between fields and local variables.

---

>                                                     The same goes for
> class loading and unloading.  For local variables on the JVM stack for
> each
> thread, the GC functions are slightly different than for fields in an
> object,
> but the principle is the same.

You can write the interface so that the GC needs to know when a new class
is loaded (or not, but IMO it's a good design).  As far as the GC is
concerned, a class is alive as long as there are objects of that type in
the heap.  If the class data structures are actually in the heap, this
becomes easy, but if you want to keep them on the VM side of the fence,
you could potentially hijack the weak reference mechanism to get notified
when the last object dies.

---

I suspected that this might be a good idea for classes.  Thanks for
the confirmation.

There should not be any reason to highjack the weak reference
mechanism with this GC interface design as the GC mechanism
is notified when a class is deleted, which can _only_ occur when
there are no references to it.

Maybe I should state something that I consider to be of value to this
JVM design.  Both Robin and Rodrigo have been sniffing around the
edges of it in their critique.  And they both have some _good_ points
about what I have put together.  And I am learning quite a bit as I
think about their issues.

I think this JVM design has some strong intrinsic features in that I
explicitly do _not_ depend on a lot of heap allocation for my major
runtime structures, that is, for those that exist over a significant part
of the life of the JVM, namely the thread, class, and object tables.
In their place, a somewhat static malloc-type allocation (huh?) is done
for THREAD(), CLASS() and OBJECT() structures-- meaning that I do
a single heap allocation for the whole of each table and keep it until
the JVM shuts down.  (These table designs, of course, may be
changed as necessary.)  I do _nothing_ fancy in the way of
managing memory.  Period.  And Intentionally.  This attitude probably
comes from my experience in real-time embedded systems where the
resources are limited and non-extensible.  By applying what I consider
to be _extremely_ conservative memory management tactics, I think
that this design will have some inherent reliability and speed built into
it that may not be obvious upon first blush.  With that said, I am very
interested in the numerous ideas for architectural changes and
improvements, and I look forward to Robin's forthcoming suggestions
for a new API for the GC mechanism.

---


Regards,
Robin






Dan Lydick

Mime
View raw message