harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Blackburn <Steve.Blackb...@anu.edu.au>
Subject Re: Terminology etc
Date Wed, 25 May 2005 07:57:01 GMT
Weldon Washburn wrote:
 > a)
 >
 > Inlining app code inside of app code is basically orthogonal to the
 > language used to write a JIT.

Of course.

 > b)
 > A GC written in C++ can inline stuff inside the GC module without
 > problems.  Also, a JIT written in Java/C# can inline parts of itself
 > with no problems.

Of course.

 > c)
 > It probably does not make a lot of sense to worry about inlining
 > between two VM modules until we get the initial Harmony VM up and we
 > can conduct performance analysis.  For example, inlining the
 > classloader into the JIT may not provide any performance gains.  The
 > downside of over aggressive inlining is code bloat which can impact
 > instruction cache performance and startup times.

Sure.  I only see inter-module inlining as an issue when the modules
are going to have very fine grain interactions.  I can't think of any
examples, so this is not a big concern for me at this stage either.

 > d)
 > I suspect that inlining GC write barriers and allocation sequences in
 > the binary image emitted by the JIT is basically orthogonal to the
 > languages chosen to implement the JIT and GC modules.  Regardless what
 > language is used for developing the GC, an inlinable "bump the
 > pointer" allocation sequence can always quickly be written in Java
 > source code, Java byte code or even the JIT's IR (internal
 > representation).  The same applies to the write barrier.

I agree it is orthogonal to the language chosen to write the JIT. It
is not orthogonal to the language chosen to write the GC.

As you say, you really want the barrier expressed in either Java,
bytecode or IR.  So you could write your GC in C and then specially
code up your barriers and allocation sequences in Java or bytecode
(coding in IR immediately introduces a dependence between your GC and
a particular compiler).  In this case you're no longer coding in C (to
paraphrase Geir ;-).  You're coding in Java|bytecode|IR and C.  Perhaps
the mix of Java and C within a core element of the GC would be
manageable.  But why torture yourself?  More importantly, if your goal
is performance, modularity and flexibility (MMTk supports 8 or so GC
algorithms), then treating barrier implementations as a special case
is needless pain.

A much better example is the allocation sequence for a free list
allocator (a non moving allocator is an important choice in some
settings).  In this case, it is not just a question of a few lines of
Java, because the allocation fast path includes the code for mapping
the allocation size to the size class as well as the free list lookup.
This is no longer trivial.  Now when it is coded in Java, the
optimizing compiler inlines the entire allocation fast path, exploits
the fact that the size of scalar allocations are statically know,
statically evauates the sizeclass lookup, and thus has the sizeclass
index as a constant.  We showed in our ICSE paper that this allowed us
to outperform glibc's malloc(), which implemented essentially the same
allocator in C.  Sure, you could go and code all that up in Java (or
bytecode, or IR) and do the rest of the GC in C, but why?

 > I would like to see a list of between module inlining candiates that
 > are known to cause a performance problems.  In ORP we routinely loaded
 > the JIT and GC as *.dll/so.  I don't recall observing compelling
 > dll-to-dll inlining opportunities.  Otherwise we would have trashed
 > the dll/so and gone back to static linking the entire VM.

As I said above, I don't forsee inter-module calling as a likely
source of performance concerns.  I have advocated right from the
outset and will continue to advocate a framework that allows modules
to be written in either Java or C.

I hope this clarifies some of the issues.

Cheers,

--Steve


  /**
   * Allocate <code>bytes</code> contigious bytes of zeroed memory.<p>
   *
   * This code must be efficient and must compile easily.  Here we
   * minimize the number of calls to inlined functions, and force the
   * "slow path" (uncommon case) out of line to reduce pressure on the
   * compiler.  We have a call to an abstract method that allows
   * subclasses to customize post-allocation.
   *
   * @param bytes The size of the object to occupy this space, in bytes.
   * @param align The requested alignment.
   * @param offset The alignment offset.
   * @param inGC If true, this allocation is occuring with respect to
   * a space that is currently being collected.
   * @return The address of the first word of <code>bytes</code>
   * contigious bytes of zeroed memory.
   */
  public final Address allocFast(int bytes, int align, int offset,
                                 boolean inGC) throws InlinePragma {
    int alignedBytes = getMaximumAlignedSize(bytes, align);
    int sizeClass = getSizeClass(alignedBytes);
    Address cell = freeList.get(sizeClass);
    if (!cell.isZero()) {
      freeList.set(sizeClass, getNextCell(cell));
      setNextCell(cell, Address.zero()); // clear out the free list link
      if (alignedBytes != bytes) {
        // Ensure aligned as requested.
        return alignAllocation(cell, align, offset);
      }
    }

    // Alignment request guaranteed or cell.isZero().
    return cell;
  }


  /**
   * Get the size class for a given number of bytes.<p>
   *
   * We use size classes based on a worst case internal fragmentation
   * loss target of 1/8.  In fact, across sizes from 8 bytes to 512
   * the average worst case loss is 13.3%, giving an expected loss
   * (assuming uniform distribution) of about 7%.  We avoid using the
   * Lea class sizes because they were so numerous and therefore
   * liable to lead to excessive inter-class-size fragmentation.<p>
   *
   * This method may segregate arrays and scalars (currently it does
   * not).
   *
   * This method should be more intelligent and take alignment requests
   * into consideration. The issue with this is that the block header
   * which can be varied by subclasses can change the alignment of the
   * cells.
   *
   * @param bytes The number of bytes required to accommodate the
   * object to be allocated.
   * @return The size class capable of accomodating the allocation
   * request.  If the request is sufficiently large then
   * <code>LARGE_SIZE_CLASS</code> will be returned, indicating that
   * the request will not be satisfied by the freelist, but must be
   * dealt with explicitly as a large object.
   */
  protected static final int getSizeClass(int bytes)
    throws InlinePragma {
    if (Assert.VERIFY_ASSERTIONS) Assert._assert((bytes > 0) && (bytes 
<= 8192));

    int sz1 = bytes - 1;

    if (BYTES_IN_ADDRESS == 32) { //32-bit
      if (COMPACT_SIZE_CLASSES)
        return ((sz1 <= 31) ?      (sz1 >>  2): //    4 bytes apart
              (sz1 <=   63) ?  4 + (sz1 >>  3): //    8 bytes apart
              (sz1 <=   95) ?  8 + (sz1 >>  4): //   16 bytes apart
              (sz1 <=  223) ? 14 + (sz1 >>  6): //   64 bytes apart
              (sz1 <=  734) ? 17 + (sz1 >>  8): //  256 bytes apart
                              20 + (sz1 >> 10));// 1024 bytes apart
      else
        return ((sz1 <=   63) ?    (sz1 >>  2): //    4 bytes apart
              (sz1 <=  127) ? 12 + (sz1 >>  4): //   16 bytes apart
              (sz1 <=  255) ? 16 + (sz1 >>  5): //   32 bytes apart
              (sz1 <=  511) ? 20 + (sz1 >>  6): //   64 bytes apart
              (sz1 <= 2047) ? 26 + (sz1 >>  8): //  256 bytes apart
                              32 + (sz1 >> 10));// 1024 bytes apart
    } else { //64-bit
      if (COMPACT_SIZE_CLASSES)
        return ((sz1 <= 95) ?      (sz1 >>  3): //    8 bytes apart
              (sz1 <=  127) ?  6 + (sz1 >>  4): //   16 bytes apart
              (sz1 <=  191) ? 10 + (sz1 >>  5): //   32 bytes apart
              (sz1 <=  383) ? 13 + (sz1 >>  6): //   64 bytes apart
              (sz1 <=  511) ? 16 + (sz1 >>  7): //  128 bytes apart
              (sz1 <= 1023) ? 19 + (sz1 >>  9): //  512 bytes apart
                              20 + (sz1 >> 10));// 1024 bytes apart
      else
        return ((sz1 <= 111) ?     (sz1 >>  3): //    8 bytes apart
              (sz1 <=  223) ?  7 + (sz1 >>  4): //   16 bytes apart
              (sz1 <=  319) ? 14 + (sz1 >>  5): //   32 bytes apart
              (sz1 <=  575) ? 19 + (sz1 >>  6): //   64 bytes apart
              (sz1 <= 2047) ? 26 + (sz1 >>  8): //  256 bytes apart
                              32 + (sz1 >> 10));// 1024 bytes apart
    }
  }


Mime
View raw message